m-governance using open source software

7/27/2019 M-governance using Open Source Software

1/12

K.N. Andersen et al. (Eds.): EGOVIS 2011, LNCS 6866, pp. 344355, 2011. Springer-Verlag Berlin Heidelberg 2011

Voice Enabled G2C Applications for M-Government

Using Open Source Software

Punyabrata Ghatak1, Neeraja Atri1, Mohan Singh2,Chandan Kumar Goyal2, and Saurabh Banga2

1 Department of Information Technology, Govt. of India, New Delhi 110003

{pghatak,natri}@mit.gov.in2 Centre for Development of Advanced Computing, Govt. of India, New Delhi - 110016

{smohan,chandang,bsaurabh}@cdac.in

Abstract. M-government is the extension of e-government to mobile platforms.

The advancements in mobile communication technology enable a natural

transition from the era of e-government to the era of m-government by

extending the internet from wired PCs to mobile phones. Since speech is the

most natural means of communication, by linking a mobile phone to a

VoiceXML gateway we are able to build voice enabled Government-to-Citizen

(G2C) applications which are accessible ubiquitously by anyone, anytime. Our

implementation of the voice gateway successfully integrates the mobile

telephone network with automatic speech recognition, text to speech synthesis

for English and Hindi, and web navigation systems based on open standards and

using open source software. We describe three voice enabled m-governance

G2C applications on the open source Android platform. The platform specific

m-governance applications can be downloaded directly on a mobile phonethrough mobile browsers for their use by citizens.

Keywords: Mobile Computing, Open Source Software, Android, VoiceXML,

Automatic Speech Recognition (ASR), Sphinx, Text-to-Speech (TTS), Festival.

1 Introduction

Wireless mobile communication technology has enabled the government to transform

from Electronic Government (e-government) to Mobile Government (m-government).

Governments can reach a greater number of citizens regardless of the countrys wired

infrastructure or the citizens economic, educational or social status. This decreasesthe digital divide among countries and social layers and benefits significantly to

citizens and the government. By migrating from traditional paper-based and/or wired

internet access based services to the wireless internet, m-government has the potential

to provide citizens with the fastest and most convenient way of obtaining government

services [1]. The number of mobile phone users in India is far greater than the number

of people who use personal computers or the Internet. Wireless mobile

communication technology provides citizens with an immediate access to certain

government information and services, on anywhere and anytime basis.


2/12

Voice Enabled G2C Applications for M-Government Using Open Source Software 345

To the ordinary citizen, the basic mobile phone is the only easy-to-use medium forinformation access. The most common m-government G2C applications includeinformation retrieval and update by various users, as well as issuing alerts bygovernments mainly through SMS. However, most of the mobile phones are notsuitable for the transmission of complex and voluminous information and do not haveequivalent features and services of wired internet access devices. The user interface ofa mobile device (screen size and keyboard) is still far from ideal, limiting the types ofservices offered. Also, in India, as in other developing countries, with diverselinguistic and cultural groups of citizens, support for different local languages is acrucial issue.

Speech is the most natural means of communication for humans. Also there is nopossibility of a virus from a phone call and it is typically much more secure. Voicebased services on mobile phones in local languages would allow citizens to get accessto government information ubiquitously. However, this requires speech technology tobe available in the local languages of the country. Two types of language technologyare needed text to speech (TTS) to deliver information, and automatic speechrecognition (ASR) to access it and control its delivery. Of these, TTS is the mostessential technology needed because (i) voice services can manage without ASRthrough the use of touch-screen or DTMF keys, (ii) a single TTS system can coverquite a large region using a neutral dialect.

VoiceXML supports such human-computer dialogs via spoken input and audio

output. VoiceXML is an application of the eXtensible Markup Language (XML)

defined by World Wide Web Consortium (W3C) that defines dialogs between humans

and machines in terms of audio files to be played, text to be spoken, speech to be

recognized, and touch-tone input to be collected [2]. A major advantage of

VoiceXML is that it provides web content over a simple telephone device or a mobilephone, making it possible to access an application even without a computer and an

Internet connection [3]. Comparable to HTML that is interpreted by a Web browser,

VoiceXML is interpreted by a voice browser. Audio input is handled by the voice

browsers speech recognizer. Audio output consists both of recordings and speech

synthesized by voice browsers text to speech system. The voice browser runs on a

specialized voice gateway server that is connected both to the Internet and to the

public switched telephone network (PSTN). The voice gateway connects to the web

servers on the Internet using the HTTP protocol. Thus by using VoiceXML

applications, we can reach out to more users than is possible by using the Internet.

2 Challenges

Although the ultimate goal of providing access to information using voice is to build a

natural language understanding system that understands the query, retrieves

information from the Internet and then extracts the relevant answer from the retrieved

information, the state of art technology is yet to be developed. However, automatic

speech recognition in a domain specific manner with a finite number of words is

practically feasible.


3/12

346 P. Ghatak et al.

In mobile communications, background noise is always present and extremelyvariable. Mobile devices are used in every day, in a variety of locations andenvironments. The setting could be an office or an airport, railway station, automotiveinteriors and other outdoors, with an acoustically challenging environment. The mostdemanding situation is the non-stationary noise coming from people talking in thebackground [4]. Also, a certain proportion of mobile users frequently change handsetto hands-free operation using portable hands-free accessories. This causes largevariations in the speech signal in addition to the conventional variation of attenuationfrom user to user. Increasing background noise degrades the performance of speechrecognizers. Users expect their mobile phones to operate in all possible acousticenvironments.

Another technological challenge is the performance degradation of the speechrecognizer caused by using low bit-rate codec used in the PSTN and GSM networks,which becomes more severe in presence of data transmission errors and backgroundnoise. The speech codec is optimized to deliver the best perceptual quality for humansand not for providing the lowest recognition word error rate (WER).

Many websites provide information through dynamic content generation whichmay require logging in using user-id and password and filling of forms on the usersbehalf to extract the information. In some cases, the information returned from thewebsite may be too long to read out to the user. Whereas a user can quickly choosethe required piece of information from the visual display, the voice mode necessitatesthat the information be either summarized or only the specific information liketemperature, humidity, flight status, etc. be extracted for converting to voice [5].

3 e-Governance Using FOSS

We now live in an on-demand society where information is available instantly,whenever and wherever we need it. The Internet has given us this instant access, andcentral to its success lies the open source culture: the willingness to share informationfreely. The government can also benefit from adopting an open source culture. Itwould facilitate mass collaboration and development of community-based innovationwhich can be the pillars of an efficient e-government. Although, there are not yetbest-practice models to bench mark m-government development, free and open sourcesoftware (FOSS) provides a viable solution due to its low and effective cost models,ability to employ local talent leading to the development of local industry andavailability of various localized distributions.

Localization is one of the areas where FOSS becomes a preferred option for m-governance because of its open nature. Department of Information Technology,Government of India, has developed a localized version of the GNU Linux operatingsystem distribution, called the Bharat Operating System Solutions (BOSS), withIndian language support and packages relevant for use in the Government domain [6].Our voice gateway server uses BOSS as the operating system platform whichfacilitates interoperability with other open source components of the system anddeployment of localized applications.


4/12


4 Mobile Application Development

The most important issues for mobile application development are fragmentation and

distribution. Developers need to write code for different devices and platforms. Most

of the mobile operating systems like Symbian, Android, BlackBerry, Meego,

Windows Mobile, etc. allow development of native applications for them without

establishing a business relationship with the respective vendor. But the required effort

and the complexity of supporting several native platforms are some of the limitations

that need to be addressed. Some platforms provide restricted access to its software

development kit (SDK) where as open platforms like Android grants access to all

parts of their SDK and OS.

5 Android Platform

Android is an open source platform that includes an operating system, middleware,

and applications for the development of devices employing wireless communications.

Android architecture is based on Linux 2.6 kernel [7]. This provides the basic system

functionality like process management, memory management, network stack,

security, device drivers, etc. On top of Linux kernel is the set of Android native

libraries. These shared libraries are all written in C or C++, compiled for the

particular hardware architecture used by the phone, and preinstalled by the phone

vendor. Also sitting on top of the kernel is the Android runtime, including the Dalvik

virtual machine and the core Java libraries. The Dalvik VM is Googles

implementation of Java, optimized for mobile devices. It is designed to be instantiatedmultiple times each application has its own private copy running in a Linux process.

It is also designed to be very memory efficient, being register based (instead of being

stack-based like Java VM) and using its own bytecode implementation. The Dalvik

VM makes full use of Linux for memory management and multi-threading, which is

intrinsic in the Java language. Situated above the native libraries and runtime, is the

Application Framework Layer which provides many higher-level services to

applications in the form of Java classes. At the top of the Android software stack are

applications. Each Android application runs in its own Linux process an

instantiation of the Dalvik VM which protects its code and data from other

applications. Android offers a custom plug-in for the Eclipse IDE, called Android

Development Tools (ADT) that is designed to give a powerful, integrated

environment in which to build Android applications. The user needs to define thetarget configuration by specifying an Android Virtual Device. The code is then

executed on either the host-based emulator or a real device, which is normally

connected via USB.

An Android application may consist of just one activity or it may contain several.

Android applications do not have a single entry point for everything in the

application. The system can instantiate and run any of the essential components which


5/12


are activated by asynchronous messages called intents. An intent is an Intent object

that holds the content of the message. It is a passive data structure holding an abstract

description of an action to be performed. TheIntent.ACTION_CALL is an intent used

to initiate a phone call from the application program code using the default Telephony

Manager of Android. The telephone number of the PSTN connection to the voice

gateway server is provided in the data field of the Intent.ACTION_CALL object. A

call frame is generated by appending user selected language for communication,

which is either English or Hindi coded as 1 or 0, to the 10 digit PSTN telephone

number. This concatenated string is provided as input data to the

Intent.ACTION_CALL object. When the application program is run on the mobile,

this call frame is automatically dialed and the voice gateway server decodes the call

and returns the necessary information through voice in the chosen language in real

time.

6 Voice Gateway Server Architecture

The main components in the voice gateway involve telephone management, theVoiceXML interpreter and the speech recognition and synthesis engines. Traditionalvoice gateway systems are built on top of expensive proprietary voice engines, whichin turn are built on expensive proprietary telephony hardware. Using open sourcesoftware for the gateway components allows system to be integrated with moreflexibility and ensures lower costs. By linking a mobile phone to the VoiceXMLgateway, voice enabled mobile applications can be built which are accessible byanyone anytime (Fig.1). The W3C VoiceXML 2.0 specification describes the

components needed to construct a fully compliant VoiceXML platform [8]. Ourgateway uses OpenVXI as the VoiceXML interpreter, Festival provides thesynthesized text for English, Sphinx as the speech recognizer and the telephonyplatform is Asterisk [9]. Asterisk is also used for playing audio files and DTMFrecognition. By developing the data fetch engine for extracting contextually relevantinformation from the websites and adding necessary glue code to these existing opensource software, we built our Linux based open source voice gateway. To supportHindi language, a Hindi TTS system is used in place of Festival.

Fig. 1. System Architecture


6/12


OpenVXI runs on Linux platform and is written in C and C++. It usesSpiderMonkey as its JavaScript engine and Xerces as the XML parser, which areopen source projects available on the Linux platform. The Festival speech synthesissystem, developed at CMU, is a Linux based open source framework written in C++,for creating TTS systems [10]. PocketSphinx is an open source speech recognitionsystem developed by CMU [11]. Asterisk is an open source PBX. It runs on a Linuxplatform and is written in C. The main components of the voice gateway, OpenVXI,Asterisk, Festival and Sphinx are all mature and active open source projects whichensure the longevity and reliability of our gateway architecture.

To construct the voice gateway, we firstly need a means of integrating OpenVXIinto Asterisk for routing calls to the Voice XML interpreter. VoiceGlue open sourceproject provides VoiceXML implementation with OpenVXI and Asterisk [12]. UsingOpenVXI version 3.4, VoiceGlue can process VoiceXML 2.0 code. VoiceGlue hasbeen integrated with Asterisk through the Asterisk Gateway Interface (AGI), asshown in Fig.2. Modifications have been made in the Perl code in the filevoiceglue_tts_gen inside /usr/bin/ directory to integrate Festival TTS server withVoiceGlue. We have also included necessary code in the voiceglue_tts_gen script sothat SSML tags within the VoiceXML document are interpreted by the Festival TTSengine [13].

Fig. 2. VoiceGlue Architecture [12]

PocketSphinx is an open source large vocabulary, speaker independent continuousspeech recognition engine and it depends on the SphinxBase library for speechrecognition which provides common speech decoding functionality across all CMUSphinx projects. A client server model is followed for integrating the PocketSphinxspeech recognition system with Asterisk [14]. The Asterisk generic speechrecognition engine is implemented in the res_speech.so module. This moduleconnects through the generic speech API to speech recognition software. A smallplug-in resp_speech_sphinx.c goes into Asterisk core and acts as the client. It is usedto connect the Speech API calls from Asterisk dialplan to the speech recognition


7/12


engine. The speech recognition is done by the server astsphinx.c which is writtenusing PocketSphinx 0.5.1 and SphinxBase 0.4.1 [15]. To receive speech recognitionrequests the server code should be running and listening on the same port as specifiedin client plug-in. Thus the astsphinx.c which acts as server should be compiled andrun in background co-existing with asterisk system.

The client code res_speech_sphinx.c added to asterisk source code as plug-incompiles to form the module res_speech_sphinx.so while building the asterisk systemfrom source code. The source file res_speech_sphinx.c, available as an option inasterisk source code in the directory asterisk/res/, is included in asterisk core forcompilation. This module gets loaded when asterisk starts. A configuration filesphinx.conf is also loaded to the series of default asterisk configuration files inetc/asterisk/ directory. This file provides configuration settings for theres_speech_sphinx.so module. The first speech API that is called for starting speechrecognition is SpeechCreate(Engine Name). The Engine Name parameter refers tosphinx in our case. The acoustic model used for speech recognition is Communicatorsemi-continuous model, Communicator_semi_40.cd_semi_6000, for 8 khz telephonespeech. The speech function SpeechLoadGrammar(Grammar Name | Path) loadsgrammar where the parameter Grammar Name refers to the grammar file generatedusing cmudictand Path refers to the directory where it is stored. An open source Perlprogram lmgen.pl creates grammars for use with the astsphinxserver [15]. For input,it requires a copy ofcmudict, and a simple text file containing the words and phrasesto be recognized. Our system uses small vocabularies up to a maximum of 100 words.The function SpeechActivateGrammar(Grammar Name) activates the specifiedgrammar to be recognized by the engine. The SpeechStart() API is then called whichtells the speech recognition engine that it should start trying to get results from audio

being fed to it.

Fig. 3. Block Diagram of Voice Gateway Server

To use any G2C service, the user invokes an application on the mobile phone with

certain options coded as DTMF. The application automatically dials the PSTN

telephone number connected to the voice server. The call lands on asterisk through

one of the 30 channels of the ISDN PRI (Primary Rate Interface) connection for


8/12


PSTN. This connectivity is provided by the Computer Telephony Interface (CTI)

hardware of the voice gateway. Depending on the dialplan settings in Asterisk,

appropriate message is prompted back to the user requesting spoken input. After the

user speaks the requested information, the astsphinxserver recognizes the speech and

returns the result to the Asterisk server. This recognized speech is then passed to

Asterisk-Java server using AGI [16]. The Asterisk-Java program runs a Java

application by providing a container that receives connections from the Asterisk

server, parses the request and fetches the necessary information from the designated

web server on the Internet. The type of query is either HTTP GET or POST. If the

required information is hosted on the remote server as a Web Service, then SOAP

protocol is used to fetch the information as an XML file and Java Architecture for

XML Binding (JAXB) is used to extract the desired information from the fetched

XML file. For standard HTML based websites, wget utility is used to fetch theinformation. The required data is then extracted from the fetched information. In both

the cases, the extracted information is written into a VoiceXML file in real time. The

Asterisk server then invokes the VoiceGlue server to process this VoiceXML

document using OpenVXI. VoiceGlue internally calls Festival to convert the textual

information into an audio WAV file which is then played back to the user through the

Asterisk PBX.

Efforts have been made by us to customize Festival TTS engine for better

pronunciation of Indian names. Spelling convention for Indian names does not follow

the spelling rules for standard English words. Different sets of letter-to-sound rules

are therefore to be applied for such names in the dictionary [17]. The Carnegie Mellon

Pronouncing Dictionary (cmudict 0.6) has been used for this purpose. First the

phonetic transcriptions according to Indian English pronunciations of the spelledwords were defined in the cmudict.scm file inside the cmu subdirectory in the lib

directory of the Festival distribution. Each line of the cmudict.scm file contains a

spelled word followed by the pronunciation specified by a string of phoneme

symbols. Then the cmudict.scm file is recompiled to produce the cmudict.out file

using the cmu2fttool. We have also provided SSML support for Festival by creating

appropriate configuration files inside its lib directory.

The Hindi TTS has been developed through Department of Information

Technology, Government of India, initiative as a separate project. The TTS has been

integrated with data fetch engine for delivering audio information in real time. The

TTS system is based on Festival which has been modified to enable UTF-8 input for

Hindi. The TTS has a Mean Opinion Score (MOS) of 3.16 and is domain and

vocabulary independent.

7 Prototype Implementation

Three voice enabled applications have been developed to evaluate the proposed

architecture and design. Our implementation provides useful insights for building a

scaled up system based on open source software. The system at present can handle


9/12


10/12


Fig. 4. Data Flow Diagram of Vegetable Prices Application

7.2 Weather Update Application

India Meteorological Department provides current weather observations city wisethrough their website www.imd.gov.in. Our prototype application delivers currentweather status of a city on mobile phone which includes weather condition,temperature, and relative humidity. When the user invokes the application on hismobile phone, a screen appears on the mobile display where he needs to choose theinput option which is either voice or text. If the user selects the voice input option, thesystem prompts him to speak the name of the city whose weather information is to beretrieved. After recognizing the city name the system retrieves the current weatherinformation from the website and converts it into a voice message which is thencommunicated back to the user. If text input is chosen, another screen appears on thedisplay which allows the user to select the name of the city from a given list of citiesarranged in alphabetical order. The same steps are then repeated, as in case of voice

input, to fetch the required information and delivering it in the form of speech.

7.3 Flight Status Application

The www.newdelhiairport.in portal provides live flight information of all domesticand international flights arriving and departing from the Indira Gandhi InternationalAirport, New Delhi. The flight status information provides arrival and departureupdates based on flight numbers for different airline carriers. Our application deliversthis live flight information on mobile phone in the form of voice both in English and


11/12


Hindi. In this case the input is only in the form of text which includes entering theflight number using the keyboard, choosing either arrival or departure status and alsochoosing the language of output flight status information. After getting these inputs,the application fetches the information from the website as text, converts it into adialogue and returns the flight status in voice on the mobile phone.

8 Conclusion

We propose a standard architecture and design of an open source voice based servicedelivery platform for certain types of m-governance applications. Technical detailshave been provided on how to integrate Sphinx ASR and Festival TTS with OpenVXIand Asterisk PBX to build a voice gateway server. The voice servers data fetch

engine developed by us connects the World Wide Web to the voice interface. Threesimple Android mobile applications have been developed using the platform todemonstrate the benefits of using free and open source software for m-governance.The system also shows the importance of developing open source local language TTSengines, such as Hindi, for m-governance applications. The system is functional andwork on future enhancement is aimed at providing support for other Indian languages.

This paper is in part based on research funded by the Department of InformationTechnology, Government of India, under the project National Resource Centre forFree & Open Source Software (NRCFOSS). The views and conclusions containedherein are those of the authors and should not be interpreted as necessarilyrepresenting the official policies or endorsements, either expressed or implied, of theCentre for Development of Advanced Computing (C-DAC) or Government of India.

References

1. Sheng, H., Trimi, S.: M-government: technologies, applications and challenges. ElectronicGovernment, An International Journal 5(1), 118 (2008)

2. Danielsen, P.J.: The Promise of a Voice-Enabled Web. IEEE Computer, 104106 (August2000)

3. Singh, K., Park, D.-W.: Economical Global access to a VoiceXML Gateway Using OpenSource Technologies. In: Coling 2008: Proceedings of the Workshop on Speech Processing

for Safety Critical Translation and Pervasive Applications, Manchester, pp. 1723 (August

2008)

4. Dobler, S.: Speech recognition technology for mobile phones. Ericsson Review (3), 148155(2000)

5. Chauhan, H., Dhoolia, P., Nambiar, U., Verma, A.: WAV: Voice Access to WebInformation for Masses. In: W3C Workshop, New Delhi (May 2010)6. Bharat Operating System Solutions, http://www.bosslinux.in7. Android, http://developer.android.com8. W3C, Voice Extensible Markup Language (VoiceXML) Version 2.0,

http://www.w3c.org/TR/voicexml20

9. Asterisk The Open Source Telephony Projects, http://www.asterisk.org10. The Festival Speech Synthesis System,

http://www.cstr.ed.ac.uk/projects/festival/


12/12


11. CMU Sphinx Speech Recognition Toolkit,http://cmusphinx.sourceforge.net/2010/03/

pocketsphinx-0-6-release/

12. VoiceGlue, http://www.voiceglue.org/13. W3C, Speech Synthesis Markup Language (SSML), Version 1.0,

http://www.w3c.org/TR/speech-synthesis

14. Zaykovskiy, D.: Survey of the Speech recognition Techniques for Mobile Devices. In:SPECOM 2006, St. Petersburg, pp. 8893 (June 2006)

15. Asterisk Sphinx Speech Recognition Engine Plugin,http://www.scribblej.com/svn/

16. Asterisk-Java,http://asterisk-java.org/17. Sen, A.: Pronunciation rules for Indian English Text-to-Speech System. In: Workshop on

Spoken Language Processing, Mumbai, India, pp. 141148 (January 2003)

m-governance using open source software

Documents