filitext: a filipino hands-free text messaging application3. related literature 3.1 filipino text...

FiliText: A Filipino Hands-Free Text Messaging ApplicationJerrick Chua, Unisse Chua, Cesar de Padua, Janelle Isis Tan, Mr. Danny Cheng

College of Computer Studies De La Salle University - Manila

1401 Taft Avenue, Manila (63) 917 5120271

[email protected], [email protected], [email protected], [email protected], [email protected]

___________________________________________________________________________________

ABSTRACT This research aims to create a hands-free text messaging

application capable of recognizing the Filipino language which

will allow users to send text messages using speech. Using this

research, other developers may be able to study further on Filipino speech recognition and its application to Filipino text messaging.

Keywords Speech Recognition, Filipino Language, Text Messaging

1. INTRODUCTION Texting while driving has been a problem in most countries since

the late 2000s. The Philippine National Police (PNP) reported about 15,000 traffic accidents in 2006, averaging on 41 accidents

per day. It is concluded that most accidents are caused by error on

the part of the driver. Additionally, traffic accidents caused by

cellphone use while driving represented the highest increase among the causes of traffic accidents [2]. According to Bonabente

(2008), `The Automobile Association Philippines (AAP) has

called for an absolute ban on use of mobile phones while driving,

saying it was the 12th most common cause of traffic accidents in the country in 2006.' AAP said that the using cell phones, even

hands-free sets, while driving could impair the driver's attention

and could lead to accidents.

An existing software application that helps people use words to

command their phones what to do is Vlingo. It is an intelligent

voice application that is capable of doing a lot of things other than

just allowing users to text while driving [3]. It is also a multiplatform application which is available for Apple, Android,

Nokia, Blackberry, and Windows Mobile. There is another

software developed by Clemson University called VoiceText. It

allows the driver to send text messages while keeping their eyes

on the road. Drivers using VoiceText put their mobile phones in

Bluetooth mode and connect it to their car. It is through the car's

speakers system or through a Bluetooth headset thar drivers are

able to give a voice command and deliver a text message. StartTalking is another existing software from AdelaVoice that

allows the user to initiate, compose, review, edit and send a text

message entirely by voice command. However, this certain

application is only available for Android 2.0 and above [4]. There are other applications that are similar to the ones mentioned above

but they all have the same purpose - to help lessen the cases of car

accidents caused by distracted driving.

2. SIGNIFICANCE OF RESEARCH Western countries have started to develop hands free texting

applications that has helped in reducing the number of car

accidents caused by texting while driving and some of these have

capabilities to understand Chinese. However, the Philippines, which is considered the text capital of the world, still has no such

applications that prevent drivers from texting while driving

mainly because there are no Filipino language capabilities on the

existing applications so far.

There are party-lists and organizations that support this kind of

law. Buhay party-list filed a bill seeking to penalize persons

caught using their mobile phones on the road. The cities of Manila, Makati and Cebu has successfully banned this kind of act

on the paper, however, it has not been properly enforced [1]. Even

when there is a law that bans Filipinos from using their mobile

phones while driving, it has not been strictly implemented and there are only a few ways of knowing whether a person is really

following this law. The development of a local version of an

existing hands-free text messaging application, Vlingo InCar,

delivers an alternative for Filipinos. This service aims to keep driver's hands on the wheel and keep concentration solely to his

environment. The Philippines, known as the Texting Capital in the

World, may have lesser traffic accidents in the future when

drivers have been granted to use hands-free mobile phones on the road rather than having to glimpse and read a text message from

one of his contacts. Hands-free text messaging is not only helpful

in restricting drivers from using their hands to text while driving,

but it may also be used by physically disabled individual with normal speech or simply those who are used to multitasking or for

some cases where a person may need to use both hands to perform

an activity. An application of this would be for those busy

businessmen and women who need to do a lot of things in a short

amount of time due to the load of work they need to finish.

Having such an application would be helpful in their daily routine

because they no longer need to use their hands in sending an

urgent message to their colleagues while attending to other urgent matters.

Alongside with this useful application, this research will be able

to shine a light on speech recognition of the Filipino language, a topic that is lacking research and in depth analysis. When deeper

studies have been conducted, it could be used as a stepping stone

for various studies in the future involving Filipino speech recognition.

58 Proceedings of the 8th National Natural Language Processing Research Symposium, pages 58-63

De La Salle University, Manila, 24-25 November 2011

3. RELATED LITERATURE

3.1 Filipino Text Messaging Language Manila's lingua franca was used as the base for the Philippine's

national language, Filipino, and this is commonly used in the urban areas of the country and it is also spreading fast across the

country [6]. Tagalog is the structural base of the Filipino language

and it was commonly spoken in Manila and the provinces of

Rizal, Cavite, Laguna, and many more.

As conducted on a study by Shane Snow, the Philippine is still

considered the Text Capital of the World. With the constraint of

Short Messaging System (SMS) of 160 characters or less to send a message, people learned how to shorten what they wanted to say

that is now referred to as „text speak‟. One simple way of

shortening a message is by taking out all the vowels; however,

this does not work for some words because it gives out an ambiguous feel for words that have similar consonant sequences.

Phonetics or how the word sounds like also plays a role in

shortening messages for Filipino texters [7]. Examples for such

are „Dito na kami’ becomes „D2 n kmi’ and „Kumain na ako’ becomes „Kumain n ko’.

3.2 Speech to Text Libraries Speech-to-Text systems are already available as desktop

applications, and some of these systems give out their APIs and/or libraries for those who want to use their system to create a new

desktop application. Some of these mentions systems are

CMUSphinx, Android Speech Input, Java Speech API and

SpinVox Create. Among all the APIs and libraries available, CMU Sphinx is the most appropriate. CMUSphinx is giving out

their toolkit ‘CMUSphinx Toolkit' which comes with various tools

used to build speech applications, these tools include a recognizer

library, a support library, language model and acoustic model training tools, and a decoder for speech recognition research, are

just some of the tools offered by CMUSphinx. It also has a library

for mobile support called as PocketSphinx. CMUSphinx can also

generate its own pronunciation dictionary with the help of an existing dictionary as a basis, but the pronunciation generation

code only supports English and Mandarin.

3.3 CMU Sphinx-4 Sphinx-4 is a Java-based, open source and automatic speech recognition system [8]. Sphinx-4 is made up of 3 core modules,

namely, the FrontEnd, the Linguist, and the Decoder. The

Decoder module is the central module, which takes in the output

of the FrontEnd module and the Linguist module. From their output, it generates its results, which it passes to the calling

application. The Decoder module has a single module, the

SearchManager, which it uses to recognize a set of frames of

features. The SearchManager is not limited to any single search algorithm, and its capabilities are further extended due to the

design of the FrontEnd module.

The FrontEnd module is responsible for the digital signal

processing. It takes in one or more input signals and parameterizes them into features which it then passes these features off to the

Decoder module.

Finally, there is the Linguist module, which is responsible for

generating the SearchGraph. The decoder module compares the features from the FrontEnd module against this SearchGraph to

generate its results. The Linguist model is made up of 3 sub-

modules: the AcousticModel, the Dictionary, and the

LanguageModel. The AcousticModel module is responsible for the mapping between units of speech and their respective hidden

Markov models. The LanguageModel module provides

implementations which represent the word-level language

structure. The dictionary dictates how words in the LanguageModel are pronounced.

CMU Sphinx can also use a language model that is made by a

user. With this language model users can create a grammar that is suitable for their own language and with the help of an acoustic

model, a user can fully utilize the language model patterned after

their own native language.

4. SYSTEM DESIGN

4.1 Overview FiliText is an application for desktop computers designed

especially for the Filipinos. It serves as a stepping stone of future

developers to create a hands free texting application for mobile

phones. FiliText is a system that accepts audio files, specifically in a Waveform Audio File Format or WAV, as an input and

processes it through a speech recognition API to convert the

message into a text. The conversion of the message will be

generated after the user has acknowledged that the voice input is already done. This application will produce two outputs. The first

output will be a converted message with the proper and complete

spelling in Filipino. As an option, the user may choose to

compress the text output into a SMS due to the fact that most cell phone carriers allow only up to 160 characters per message.

4.2 Architecture The system will begin by first gathering input, a spoken message,

through the input module. The input module will then pass off the unprocessed spoken message to the Sphinx-4 module, configured

for recognizing Filipino. Sphinx-4 will then pass off the now text-

based message to the message shrinking module, which will apply

common methods of reducing word length. Finally, the shrunken, text-based message will be passed off to the output module, which

will display said output to its user.

Figure 2. Architectural Design of FiliText

The system will rely on Sphinx-4 in order to convert the spoken

message into its respective text format. Because Sphinx-4 is

highly configurable, a speech recognition module would not need

59

to be coded from scratch. Instead, the Sphinx-4 module will be

trained and configured to recognize informal Filipino.

The input will first pass through the FrontEnd module, which will

handle the cleaning and normalizing of the input. Little effort will

be placed into configuring and optimizing the FrontEnd as it deals with digital signal processing.

The Linguist module will create the SearchGraph, which the

Decoder module will use to compare the input against in order to generate its results. The Linguist module will be the most

configured of the three as it contains the hidden Markov models,

the list of possible words and their respective pronunciations, and

the acoustic models of phonemes. Sphinx-4 does not have any of the necessary files to understand Filipino, so the dictionary, the

acoustic models, and the language models will be created by the

proponents using the tools provided by CMU Sphinx group.

The Language model will be created using a compilation of

Filipino text messages, newscast transcripts, Facebook posts, and

Twitter feeds. These will be placed into a text file with the format,

<S> text n </S>. A vocabulary file, a text file listing all

Filipino words used, will also be created and used to generate the

language model. The vocabulary file will not include names and

numbers. These two files will be used by the CMU-Cambridge Language Modeling Toolkit (CMUCLMTK) to create an n-gram

language model. Aside from the implementation, the created

language model is necessary for the creation of the Acoustic

Model.

The Acoustic Model submodule of the Linguist module will be

trained using SphinxTrain, a tool also created by the CMU Sphinx

group for generating the acoustic models a system will use. To train the acoustic model, recording from different speakers will be

compiled and each audio file would be transcribed. Each

recording will be a wav file sampled at 16 kHz, 16-bit mono, and

segmented to be between 5 to 30 seconds long [5]. The set of speakers will include males and females of 16 years of age or

older.

The Decoder module compares the processed input against the SearchGraph produced by the Linguist to produce its results. The

Decoder's SearchManager sub-module will be configured to use

the implemented SimpleBreadthFirstSearchManager, an

implementation of the frame-synchronous Viterbi algorithm.

The message shrinker module will use word alignment, a

statistical machine translation algorithm to shorten the output of

the Sphinx-4 module that will still be understandable. The output of this module would be a text with at most 160 characters unless

it is stated that there is no possible way to make the text shorter

than 160 characters. This shortened message will then be sent to

the output module, which will display the message to the user.

4.3 Customization of Sphinx-4 Since the documentation of the Sphinx-4 has specified the steps in

creating the language model and acoustic model of a new

language when needed, it was somehow easy to create a

prototype. The challenge in customizing the Sphinx-4 for a totally different language is getting all the recordings and having them be

trained by SphinxTrain.

The initial task to perform was to gather audio recordings of

different speakers that will use all the phonemes of Filipino in their recordings. The audio file format must be in 16-bit mono and

16 kHz and it must not be shorter than 5 seconds to aid in the

accuracy of the acoustic model training. After gathering all the

speech recordings, they are all placed in a folder that will be run with CMUCLMTK so that it may create the dictionary of used

words and also the different phonemes it was able to detect. After

being able to run it through the language modeling toolkit, it will

be ready for training under SphinxTrain to create the acoustic model that the application would use to understand the different

words uttered by the end-user of the application.

4.4 Data Collection The corpus the proponents will be using is a Filipino Speech Corpus (FSC) created by Guevarra, Co, Espina, Garcia, Tan,

Ensomo, and Sagum of the University of the Philippines -

Diliman Digital Signal Processing Laboratory in 2002. The corpus

contains audio files and matching transcription files which will be used to train the acoustic model and the language model for the

Filipino language and use these trained models for recognition.

For the language model, contemporary sentences were also gathered from social networking sites such as Facebook and

Twitter.

4.5 Training the Acoustic and Language

Models To create a new acoustic model for Sphinx 4 to use, the CMU Sphinx project has created tools to aid in creating these models for

recognizing speech. The required files are placed in a single folder

which is considered as the speech database.

Sound recordings will be placed into the directory

wav/speaker_n. The dictionary, the mapping of words to their

respective pronunciations, that will be used for SphinxTrain will be the same as the one used in the implementation of Sphinx-4. A

single text file will be created to house the transcriptions of each

recording. Each entry in the text file must follow the following

format: <S>transcription of file n</S>

(file_n). A file with the filenames of all the sound recordings

must also be present as this will be used to map the files to the transcription. The ARPA language model will also be used by

SphinxTrain to generate the acoustic models. A phoneset file, a

file listing each phoneme tag, will be provided needed as well.

The filler dictionary will be created to only include silences. These will all be used by SphinxTrain to generate the acoustic

model. All mentioned files other than audio recordings will be

placed inside a folder labeled etc.

Because the FSC had recorded speech with lengths of 25 to 35

minutes each, the proponents had to segment each file into the

specified 5 to 30-second length. They were able to automate the process by using SoX, an open source audio manipulation tool.

This tool was able to segment the sound files according to the

existing transcriptions that came with the speech corpus. After

segmenting the sound files, the filenames were transcribed to a

fileids file and the transcriptions of each sound file was

compiled into a single file, ready for training.

60

The language model needed for the etc folder was created using

the transcription file and the CMUCLMTK. The phonetic dictionary was also created with the aid of the language modeling

toolkit because in the process of creating the language model

itself, the toolkit will first create a dictionary file with all the

words in the transcription file.

The phonetic dictionary that the proponents created used the

letters of the Filipino words as the phone for each letter. According to the acoustic model training documentation of

SphinxTrain, this approach is done when there is no phoneme

book available and it “gives very good results” [5].

Figure 3. Phonetic Dictionary Sample

The folder structure must be followed because the training process

is controlled by calling Perl scripts to setup the rest of the training

binaries and configuration files.

Before starting the training, the train configuration file

(sphinx_train.cfg) must be edited according to the size of

the speech database to be trained. The variables that must be taken

into consideration before training are the model parameters: the

tied-states (senones) and the density.

Figure 4. Approximation of Senones and Number of Densities

Training internals include but is not limited to computing the features of the audio files, training the context-independent

models and the context-dependent untied models, building

decision trees, pruning the decision trees and finally train the

context-dependent tied models.

4.6 Mobile Application Having a desktop application is very different from a mobile

application because of the limitations of the mobile devices when

it comes to size capacity, processing speed and many more. When moving the application to a mobile device, it would be better if

the size of the whole application is smaller and would still be able

to perform similarly to the desktop application. This comes as a

challenge since the application would need the acoustic model, the

language model and other linguistics related models that would be

needed to recognize the spoken text.

Sphinx-4 also has a mobile counterpart called PocketSphinx and

this is usually used for mobile based applications that require speech recognition. It has been used to develop applications for

the Apple iPhone before [5].

5. TEST RESULTS The proponents have trained three sets of acoustic model and

language model using 20 speakers, 40 speakers and 60 speakers

from the FSC. The proponents split the trainings to see whether or

not accuracy would improve when the trained data has been

increased. The proponents conducted two types of testing: controlled and uncontrolled testing. Controlled testing made use

of the existing recordings from the speech corpus while

uncontrolled testing was done with random people that were not

from the corpus.

In determining the accuracy of the system, the result text

generated is compared to the correct transcription of the

recording. The following formula is used to attain the accuracy rate of the system:

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒 =𝑚𝑎𝑡𝑐ℎ𝑖𝑛𝑔_𝑤𝑜𝑟𝑑𝑠

𝑇𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑑𝑠_𝑖𝑛_𝑡𝑟𝑎𝑛𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛 × 100

Controlled testing was done for all three trained sets and the

accuracy for each speaker per set is shown in Figure 5.

Figure 5. Accuracy Comparison for 3 Sets

It can be seen that the accuracy for the 40-speaker set dropped but

this was because it lacked training. For the 60-speaker set, the variables were adjusted to fit the size of the training data which in

turn gave better results when compared to the 20-speaker set.

Figure 6 shows the average accuracy rate for each set for the

conducted controlled testing and it is evident that the accuracy of the 60-speaker set increased as compared to the 20-speaker set.

The mean accuracy rate of controlled testing of each test was at

45% for 20-speaker set, 43.25% for 40-speaker set and 58.32%

for 60-speaker set.

Figure 6. Average Accuracy Rate for Controlled Test

61

For the uncontrolled testing, the proponents designed the language

model to compose of the transcriptions from UP Diliman and

contemporary sentences gathered from different social networking

sites. The proponents gathered 20 speakers, 10 male and 10 female, to test the system with sentences that are usually used in

daily conversations.

Using the trained data with 120-speaker set and the new language model, the system attained an average of 69.67% in accuracy and

an error rate of 30.33%.

Figure 7. Accuracy and Error Rate for Uncontrolled Test –

Male

Figure 8. Accuracy and Error Rate for Uncontrolled Test -

Female

For better results, speakers are recommended to speak in a clear loud voice and to avoid mispronunciations of the words. The

speaker should also speak in a slower pace to make each word

more distinct with one another to avoid conjunction of two

different words. The accuracy of the system will also drop when there is too much background noise present. In Table 1, actual

results produced by the system are seen:

Table 1. Sample Output

Expected Output Actual Output

Nandito ako Nandito ako

Maganda ba yun palabas Maganda bayong palabas

Tumatawag na naman siya Tumatawag nanaman siya

Pauwi ka na ba Pauwi kalaban

5.1 Creating the Mobile Application Attempting to port the existing desktop speech recognition application to an Android device proved to be a challenge. Since

the mobile version of Sphinx-4, PocketSphinx, is not well

documented yet for Android, the proponents had a hard time

installing the required software and actually creating the application for the mobile device. There were some sample

applications available online that were on a demo level however it

was tricky to install on the mobile device.

Another challenge was the limitations on the existing phones that

the proponents had. The demo application that was downloaded

and modified was too heavy for the HTC Hero such that when the

application was opened, it would close itself without any warning. Another mobile phone available for testing was a Samsung

Galaxy Ace. However, the proponents have yet to test it on the

said device.

5.2 Improving Performance As mentioned above, the accuracy for sentences not found in the

language model was very low. The proponents are currently

continuing research on how to improve the performance of the

system. There are two approaches in which the proponents will tackle: a new language model would be built with sentences that

consists of everyday conversational Filipino sentences and train

the acoustic model to be phoneme dependent.

The new language model would be built with the help of the

Department of Filipino of De La Salle University. The department

would advise the researchers on what sentences are considered to

be conversational Filipino and include these sentences in the language model. An additional resource for sentences that could

be added to the language model would be a collection of existing

text messages sent in Filipino. After this language model is

completed, the system would be retrained to follow the new language model and be tested whether improvement occurred.

Training the acoustic model to be phoneme dependent would aid

in the system to use letter-to-sound rules to guess the pronunciation of unknown words. These unknown words are not

found in the dictionary and the transcription files which mean that

they were not trained to understand these words. The letter-to-

sound rule will attempt to guess the pronunciation of unknown words based on the existing phonemes and words in the

dictionary. Again, testing would be conducted to the different sets

of models trained to see whether improvement occurred with

increments to the training data. The test results would also be compared to existing test results to see if unknown words would

really be recognized.

62

6. CHALLENGES ENCOUNTERED Sphinx-4 is a very flexible system capable of performing many

types of recognition tasks and there is a lot of documentation provided for the public. However, since this tool has not been

made specifically for the Filipino language, there are a lot of

modifications to be made.

In the demonstration programs provided by the Sphinx-4 has low

accuracy. This was due to the noise and echo included during the

testing. This certain challenge was remedied by switching off

noise reduction and acoustic echo cancellation in the microphone‟s setting.

Sphinx documentation also specified that the recorded wave file

should be set to 16-bit 16 kHz mono to be used for the training. However, with the first set of recordings, it did not follow the

specifications. By changing the sample rate of the given sound file

only resulted in it being slowed. This issue was resolved after

changing the sample rate of the project itself instead of the files.

Although Sphinx-4 has been built and tested on the Solaris

Operating Environment, Mac OS X, Linux and Win32 operating

systems, CMUCLMTK requires the use of UNIX. This was remedied by using Cygwin as recommended by the Sphinx team.

Line breaks in the tool also requires being in UNIX format. This

issue was resolved by switching to UNIX line breaks using

Notepad++.

The recorded messages that would be used in the training of the

system had less background noise. When the application is used in

normal environment which has more noise and echo, the system‟s accuracy could drop.

The issues of creating hands-free texting application also involve

the users‟ style of texting and speaking. The type of keypad their

phone has, whether a QWERTY or T9, has a factor as to how they type their text messages. There is also a difference on how a

person composes a text to the way he or she speaks in a

conversation.

An issue from the results is the lack of relevance of the trained

language model to the messages sent for text messaging. Because

of the low accuracy rate for the uncontrolled tests, the proponents

believe that the language model is the main contributor to the drop of accuracy. The language model was patterned according to the

speech uttered by the speakers in the FSC and these speeches

include stories and words. These are not sentences that are often

used in everyday texting which is why the sentences uttered by the speakers for the uncontrolled test were barely recognized by

the system.

7. CONCLUSION CMU Sphinx-4 is an effective tool to develop a desktop application that is able to recognize speech in Filipino language

and produce its text equivalent. The system is also able to use a

simple rule-based text shortening using regular expressions to provide the users the „text speak‟ equivalent of the output

produced.

There are also several recommendations that future developers may do to improve the system: Firstly, to increase the data in the

language model. These data may include English words since

most Filipinos does not use plain Tagalog in texting but mixes

English and Tagalog. Developers may also allow the user to place punctuation marks in sentences for better understanding of the

result. Other commands such as starting and ending the recording

for speech recognition may also be added as a feature

enhancement for the application. Lastly, it is recommended that the application be ported into mobile with different operating

systems such as Android and iOS.

8. ACKNOWLEDGMENTS The researchers would like to thank the following: (1) Mr. Danny

Cheng for being an adviser and guiding the group throughout the research, (2) for the entire panelist, namely: Mr. Allan Borra and

Mr. Clement Ong for the remarks and suggestions that they have

given to further improve the research, and last but not the least to

(3) EEE department of the University of the Philippines – Diliman for allowing us to use their Filipino Speech Corpus (FSC) for our

research.

9. REFERENCES [1] Bonabente, C. L. (2008). Total ban sought on cell phone use

while driving. Retrieved from

http://newsinfo.inquirer.net/breakingnews/metro/view/20080

920-

[2] CarAccidents. (2010). Philippines Crash Accidents, Driving,

Car, Manila Auto Crashes Pictures, Statistics, Info. Retrieved from http://www.car-accidents.com/country-car-

accidents/philippines-car-accidents.html

[3] Vlingo Incorporated. (2010). Voice to text applications

powered by intelligent voice recognition. Retrieved from

http://www.vlingo.com.

[4] Adela Voice Corporation. (2010). Starttalking. Retrieved from

http://www.adelavoice.com/starttalking.php.

[5] CMU Sphinx. (2010). CMU Sphinx – Speech Recognition

ToolKit. Retrieved from http://cmusphinx.sourceforge.net

[6] Gonzalez, A. (1998). The Language Planning Situation in the

Philppines. Journal of Multilingual and Multicultural

Development, 55, 5-6.

[7] BBC h2g2. (2002). Writing Text Messages. Retrieved from

http://www.bbc.co.uk/dna/h2g2/A70091

[8] CMU Sphinx Group. (2011). CMU Sphinx. Retrieved from

http://cmusphinx.sourceforge.net/sphinx4/

[9] Cu, J., Ilao, J., Ong, E. (2010). O-COCOSDA 2010 Philippine

Country Report. Retrieved from http://desceco.org/O-

COCOSDA2010/o-cocosda2010-abstract.pdf

63

filitext: a filipino hands-free text messaging application3. related literature 3.1 filipino text...

Documents