development of speech database for hindi text-to...
TRANSCRIPT
-
© 2014, IJARCSSE All Rights Reserved Page | 531
Volume 4, Issue 5, May 2014 ISSN: 2277 128X
International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com
Development of Speech Database for Hindi Text-To-Speech
System Considering Syllable as a Basic Unit Arun Kumar C* Shreekanth T Udayashankara V
Dept. of ECE, Dept. of ECE, Dept. of IT,
SJCE Mysore, Karnataka, India SJCE Mysore, Karnataka, India SJCE Mysore, Karnataka, India
Abstract: The objective of a Text- to- speech system is to convert an orthographic text into intelligible and natural
sounding speech. In order to achieve this, unit selection plays a vital role. Phoneme, diphone, allophone and syllable
are the basic units of speech system. Considering phoneme as a basic unit for concatenation based TTS system results
in larger concatenation points, this result in low quality speech output. Considering syllable as basic unit for database
building results in less concatenation points and results in high quality speech output. Hence this work reveals
building of standard text database required to build syllable level speech database considering position of syllable in a
word i.e. Start, Middle and End. This database consists of 1326 standard and non-standard words and 442 syllables in
Start, middle and end position respectively.
Keywords: Speech synthesis, Concatenative synthesis, Text processing, Speech generation, Hindi TTS system.
I. INTRODUCTION The ultimate goal of Text-To-Speech (TTS) synthesis is to convert an ordinary orthographic text into an acoustic
signal that is indistinguishable from human speech [2].This generally involves two steps:
1. Text processing. 2. Speech generation. The objective of the text processing component is to process the given input text and produce appropriate sequence
of phonemic and syllable units. These phonemic and syllable units are realized by the speech generation component
either by synthesis from parameters or by selection of a unit from a large speech corpus [3].For natural sounding speech
synthesis, it is essential that the text processing component produce an appropriate sequence of syllabic units
corresponding to an arbitrary input text [4].
Phoneme, diphone, allophone and syllable are the basic unit of speech. Phoneme is the smallest sub unit of speech
synthesis system no other letters can modify their sound.
Syllable is a cluster of consonants and vowels. Syllable should contain one vowel and any number of consonants.
1. Single vowel can act as a syllable. (I.e. V). 2. V, C*V, V*C, C*V*C, C*C*V, C*C*C*V*C*C*C……etc. 3. Consonant before vowel is called „Onset‟. i.e.(C*V) 4. Consonant after vowel is called „Coda‟. i.e.(V*C)
The databases that are developed for Text to Speech synthesis system generally consists phonemes or syllables as the
basic Concatenative unit. Such types of databases are built/collected from LDCIL and implemented by many researchers
for continuous speech synthesis and recognition system. The maximum work is been carried out for Chinese, Punjabi and
English language. Little work is done for other Indian languages. Table II shows various databases built by researchers
for TTS system.
A Speech database has been developed for developing a Text to Speech Synthesis system in Kannada Language at
Mysore. The basic entity selected for the speech synthesis in this project was phonemes. This speech database consists of
total 1,605 phonemes. The phonemes were recorded using the utility tool PRAAT on Windows Operating System
platform. The sampling frequency used for recording the speech was 16,000 Hz. The recording was done using the
standard microphone in lab. The recorded phonemes include vowels, semi vowels, stops, fricatives, nasals etc [1].
A Punjabi language Speech Database has been developed for Text to Speech synthesis system at Department of
Computer Science, Punjabi University, and Patiala. The syllables were considered for developing said speech database
for Text to Speech Synthesis system because the researchers have selected syllables as the basic unit of concatenation.
This Punjabi language speech database consists of 3,312 syllables which account for more than 99% of commutative
percentage frequency in the selected corpus. These syllables were selected after analyzing total possible syllables of
Punjabi corpus which was having nearly 2, 33,009 unique and more than four million words; out of which 9,317 were
valid syllables from which 3312 syllables were selected. The selected syllables were recorded from a speaker using
standard microphone in the studio environment [10].
http://www.ijarcsse.com/
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 532
A Text to Speech synthesis System for four Indian Languages Hindi, Odiya, Bengali and Telugu has been developed
at Department of Computer Science and Application, Utkal University, Bhubaneswar. For developing the speech corpora
for the Text to Speech System in the said four languages native speakers were searched for all the four languages. The
speakers were asked to read the text in the laboratory environment without any background noise. The text to speech
synthesis system developed use the concatenation of syllables approach for the development of the Speech Database [11].
This following section reveals the syllable rules involved in word segmentation and Concatenation based Text to
speech synthesis.
A. Syllable Rules 1. When nasals such as /n’/, half pronounced /m/ or /n/ sound succeed a vowel immediately, they would be treated
as a part of the vowel and also the same syllable. For example, /n’/ in san’sthaa will be a part of syllable
containing /sa/ [10].
2. When there are three or more consonants between two consecutive vowels, the first consonant would be a part of the coda of the previous syllable while the remaining consonants would be onset of the next syllable [10].
E.g. a b c d e
Consonant Vowel
/ab/=Coda (V*C)
/cde/=Onset (C*C*V)
3. When there are exactly two consonants between two vowels, the first consonant would be part of coda of previous syllable and the second would be onset of the next syllable [10].
E.g. a m m a
Vowel Consonant
/am/=Coda (V*C)
/ma/=Onset (C*V)
4. When the second consonant is a member of the set {/r/ /s/ /sh/ /shh/}, both the consonants would be a part of onset of the next syllable [10].
E.g. y a a t r a
/yaa/=syllable1
/tra/=syllable2
In Hindi there are 5 vowels and 5 long vowels and two diphthongs, four semivowels 33 consonants. Hindi language
is having one to one correspondence with spoken language and written form. The phonemes are divided into two type
vowels (swaras) and consonants (vyanjanas). They together constitute the (varnamala) alphabet set. Vowels are the
independently existing letters which are also called as swaras [10]. They are:
अ आ इ ई उ ऊ ऋ ए ऐ ओ औ
Consonants are those which depend on vowels to take their independent form. They are as shown below
क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न ऩ प फ ब भ म य र व श ष स ह
Based on this rule the combination of vowel and consonant together will form a syllable (C*V) also called as
kagunitha. Since kagunitha is combination of consonant and vowel this belongs to syllable group (C*V).
E.g. क + आ = का C + V = (CV)
Hindi language is syllabic in nature. Hence building speech database for TTS system considering syllable as
basic unit is better choice [4].
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 533
B. Concatenative Synthesis Concatenative synthesis simply plays back the waveform with the matching phone string. An uttered sound is
synthesized by concatenating together several speech fragments, unlike synthesis-by rule; it requires neither rules nor
manual tuning. Moreover, each segment is completely natural, so we should expect very natural output. Speech segments
are greatly affected by co articulation, so if we concatenate two speech segments that were not adjacent to each other,
there can be spectral or prosodic discontinuities. Spectral discontinuities occur when the formants at the concatenation
point do not match. Prosodic discontinuities occur when the pitch at the concatenation point does not match. A listener
rates as poor synthetic speech that contains large discontinuities, even if each segment is very natural. There are a number
of factors, which contribute to the lack of naturalness in the speech output from speech synthesis systems like:
Intonation and rhythm, variability along the prosodic parameters and incorrect segmental rendering the only task in
this method is building an error free speech database suitable for concatenation of speech units [1]. Prosody and
Intonation are also most important for natural sounding of speech.
Hindi, words could be composed of basic characters as well as complex clusters of C*V*C. For the latter cases,
there is a need to come up with rules to break the word into syllables. Hence the work depicted in this paper derives
certain simplistic rules for syllabification i.e. rules for grouping clusters of C*V*C based on heuristic analysis of several
words in Telugu and Hindi languages [10]. Concatenation based TTS system considering phoneme as a basic unit results
low quality speech output because of large concatenation points. This large concatenation points results in glitches.
Hence to avoid this error considering syllable as basic unit of concatenation is the only solution.
Hence this paper reveals how to build an error free text and Speech database for Hindi language required to
develop Concatenation based TTS system.
II. STRUCTURE OF TEXT AND SPEECH DATABASE During the process of speech synthesis, required syllable units are fetched from speech database, concatenated and
finally processed suitably to obtain quality speech output. Hence creating an error free database of syllable units is most
important. The sound and duration of syllable slightly change based on their position of occurrence in the speech. A
syllable can occur at three different positions [1].
1. At the starting of a word. (Start) 2. In between two phonemes. (Middle) 3. At the end of the word. (End) Hence for the above mentioned purpose a text database consisting of 1326 words, which covers all syllable (C*V)
set are considered. This is manually prepared using standard Hindi dictionary [12], text books and various researchers'
guidance. From all above sources text corpus consisting of 1326 standard and non standard unique words are ready for
building speech database.
This text corpus shown in Table I cover all the required syllable set in all the possible position of occurrences
i.e. Start, Middle and End. From this we can observe that many of the rarely occurring syllables like ञ, यर, र,् ङ, छ् etc. taken as it is to cover all the syllables for documentation purpose.
For speech database, Utility software for Windows Operating System, called as PRAAT [9], is used. The
prepared words were recorded using PRAAT tool with a sampling frequency of 16 KHz and represented with 16-bits [1].
The following example shows the process of building speech database. Consider the syllable required as बा, then three words बायत, आबाय, अमबा are recorded using PRAAT tool using standard microphone. Record the required words and save to list, from each recorded word extract बा in all the three possible positions. Later store the extracted syllables in their respective directories based on their position of occurrence, Figure 5, 6 and 7 shows the labeling process and Figure
1 to 8 shows the steps involved in using PRAAT tool during speech database building.
A. Procedure to build speech database The below steps shows how to use PRAAT utility software to build speech database required to implement
concatenation based TTS system.
Step1: Open the PRAAT utility software, select record monosound option from „New‟ option in menu bar.
Fig. 1 PRAAT Tool
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 534
Step2: Select 16000 Hz sampling frequency and press record to start recording the required sound.
Fig. 2 Selecting sampling frequency and recording
Step3: Utter the word which covers required unit and start recording. After recordind stop recording and save it to list.
Fig. 3 Recording and save to list
Step4: Create Text grid and start Labelling the speech waveform by selectiing view and edit option.
Fig. 5 „बा‟ Starting position
Fig. 6 „बा‟ Middle position
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 535
Fig. 7 „बा‟ End position
Step5: Extract Labeled sound files using „Extract all non-empty interval‟ option.
Fig. 8 Extract labeled speech unit
After extracting all the labeled files from uttered sound they are saved in their respective directories as shown below in
Figure 9.
Fig. 9 Directories named Start, Middle and End
„बा‟ Starting position is saved in Start directory. „बा‟ Middle position is saved in Middle directory. „बा‟ End position is saved in End Directory.
The rich speech database consist of total1326 syllable (C*V). Each position has 429 syllables and 13
independent vowels. Hence form all the three positions total of [(429*3) + (13*3)] =1326 units of speech data is built.
TABLE I: TEXT CORPUS
FRONT MID BACK
कभर कीकय खटाक कायण खकाय खटाका ककयण चककत साकक कीकय पकीय धभकी कुमया तकुवा पऩ ॊकू कूकना ककून गुडाकू
Sound Library
Start Middle End
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 536
क्रतक प्रक्रत क्र केयर याकेश तडके कैसा डकैती जाकै कोभर डकोटा भाको कोडी सकौय जाकौ कॊ कड सकॊ द कॊ क् क् क्
खकाय चखना देख खाकक भखान रेखा खखडकी भुखखमा साखख खीजना भखीय ऩयखी खुचय सखुर जाखु खून सखून जाखू ख्रऩा भाख्रत सख्र खेवा भखेर जाख ेखैयात सखैय राख ैखोवा जाखोय जाखो खौवा भुखौटा भाखौ खॊजय जखॊ सजखॊ ख् ख् ख् गगन गगन डग गात तगादा दगा गगयता फगगमा भागग गीदड दगीरा दागी गुजय झगुरी जागु गूथना फगूरा गागू ग्रह सग्रह जाग्र गेरी बॊगेडी जागे गैरयी दगैर जागै गोदात बगोडा जागो गौयव रगौय जागौ गॊदगी भगॊदा जागॊ ग् ग् ग् घटक फघय फघ घातक प्रघान साघा घघचपऩच सघघर याघघ घीना सॊघीम सघी घुटन सघुर भाघु घूभना सघून जाघू घ्रत सघ्रऩ जाघ्र घेयना भघेय दाघे घैरा भघैर सघै
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 537
घोखना सघोना भाघो घौद सघौय सघौ घॊट रघॊट घॊ घ् घ् घ् ङ ङ ङ ङा ङा ङा ङङ ङङ ङङ ङी ङी ङी ङु ङु ङु ङू ङू ङू ङ्र ङ्र ङ्र ङे ङे ङे ङै ङै ङै ङो ङो ङो ङौ ङौ ङौ ङॊ ङॊ ङॊ ङ् ङ् ङ् चक दचक ऩेच चाऩ ऩॊचाट ऩायचा गचकट ऩेगचया सगच चीरय ऩेचीदा प्रऩॊची चुका सचुक रचु चूक कचूय वाचू च्रभा सच्रभ वाच्र चटेा सचते याच ेचैतन्म सचैन चाचै चोकय कचोट चाचो चौवा कचौडी याचौ
चॊन्द्न्िका भचॊद भचॊ च् च् च्
छकाय ऩाछना ऩाछ छागर बफछाना ऩीछा घछकना बफघछमा छाघछ छीजन सछीन ऩॊछी छुवा बफछुवा वाछु छूटना सछूत राछू छ्र छ्र छ्र
छेडना सछेन ऩीछे छैत सछैत वाछै छोयी बफछोह वाछो छौका बफचौना वाछौ छॊगा सछॊद साछॊ
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 538
छ् छ् छ् जकड ऩूजना पौज जागीय खजाना ऩूजा न्द्जगय ऩून्द्जत फान्द्ज जीतना सजीत ऩाजी जुवायी बफजुभ जाजु जूट बफजूका काजू ज्र ज्र ज्र जेठ सजेन जाजे जैपवक बफजैरा जाजै जोखखभी घजोय राजो जौहय वजौय राजौ जॊगदाय सजॊग रजॊ ज् ज् ज्
झकोरा झझक जाझ झाडना फझावू साझा खझप्ना सखझना साखझ झीखना सझीन साझी झुटाना सझुना भाझु झूट जाझूना भाझू झ्र झ्र झ्र
झरेना साझरे ऩाझ ेझैर सझैरा साझै झोरी सझोरा ताझो झौय कझौय साझौ झॊकाय सझॊक झॊ झ् झ् झ् ञ ञ ञ ञा ञा ञा गञ गञ गञ ञी ञी ञी ञु ञु ञु ञू ञू ञू ञ्र ञ्र ञ्र ञ े ञ े ञ ेञै ञै ञै ञो ञो ञो ञौ ञौ ञौ ञॊ ञॊ ञॊ ञ् ञ् ञ्
टकयाव ऩाटर ऩाट टाऩना पऩटाया ऩाटा
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 539
टटकट बफटटमा फाटट टीकाकाय सटीक ऩाटी टुकडा भटुक जाटु टूटना भटूक राटू ट्र ट्र ट्र
टेकना सटेरा जाटे टैक्सी सटैय जाटै टोकन सटोरा भाटो टौर सटौर याटौ टॊकाय टॊ जाटॊ ट् ट् ट्
ठकाय ऩाठक ऩाठ ठाकुय सठाऩ ऩाठा टठग्ना गटठमा ऩाटठ ठीकडा गठीरा ऩाठी ठुनका घनठुय कठुय ठूरा सठूय ऩाठू ठ्र ठ्र ठ्र
ठेकेढाय सठेक भाठे ठैभ भठैर जाठै
ठोकना सठोक साठो ठौय कठौय जाठौ ठॊडा ठॊ ठॊ ठ् ठ् ठ् डफर सडक अखड डाककमा बफडार अगडा ङडमो अङडभ छोङड डीजर सडीर खखचडी डुफकी सडुर जाडु डूफना सडूक झाडू ड्र ड्र ड्र
डमेयी भडरे साड ेडनैा भडरै जाडै डोभनी अडोस साडो डौर सडौर ताडौ डॊका फडॊग याडॊ ड् ड् ड्
ढकना गाढन साढ ढाना गढाना गढा टढरावी गटढमा गटढ ढीरना सढीर साढी ढुरना सढुर साढु
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 540
ढूह सढूह साढू ढ्र ढ्र ढ्र ढेय सढेय साढे ढैम गढैमा साढै ढोका भढोवा वाढो ढौयी ऩढौसी जाढौ ढॊगा ढॊ ढॊ ढ् ढ् ढ् ण ण ण णा णा णा खण खण खण णी णी णी णु णु णु णू णू णू ण्र ण्र ण्र णे णे णे णै णै णै णो णो णो णौ णौ णौ णॊ णॊ णॊ ण् ण् ण्
तकना भतरी उगचत तागना बफताना अॊधता घतजाया इघतका अघत तीखा त्रतीम इभयती तुकाॊत भातुर भातु तूफ़ान भातूक सातू त्रतीम सॊत्रप्त त्र तमेीस भातभे सात ेतैनात नतैभ वात ैतोड सतोर भातो तौरना अतौर आतौ तॊगी भतॊगी भतॊ त् त् अत्
थकना थुथना अकथ थाऩना भथानी साथा गथेटाय भगथत अगथ थीभ भथीभ साथी थुथना राथुय साथु थूकना थाथु भाथू थ्र थ्र थ्र थेर भाथेन साथे
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 541
थैरा भथैन साथै थोडा हथोड जाथो थौडा हथौडा भाथौ थॊडा भाथॊगी प्रीथॊ थ् थ् थ् दकाय फॊदय नाद दाता बफदाय बफदा टदखना फॊटदश भटद दीभी भदीय फॊदी दकुडा भदरु भद ुदधू फॊदकू जाद ूिड आित साि
देखना बफदेश सादे दैघनक वदैन सादै दोगरा भादोन सादो दौड फदौना वादौ दॊगा वदॊती दॊ द् द् द्
धगडा फॊधन अध धाना फॊधान वाधा गधक अगधक आगध धीभय फाॉधीत राधी धुक फॊधुता साधु धूऩ सधूय वाधू ध्र ध्र ध्र धेना अधेड साधे
धैमरवान अधैमर साधै धोखा सधोना आधो धौखना भधौना साधौ धॊधा वधॊती रयधॊ ध् ध् ध्
नकटा ऩनही अॊकन नाका ऩनाह अधाना घनकट सघनह यानी नीका ऩनीयी अॊजनी नुकीरा अनुजा अनु नूतन कनूत जानू न्रशॊस न्र न्र नेती जानेक अॊजाने नैघतक फनैरा जानै नोचना भनोज भानो नौकय कनौज भानौ
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 542
नॊगरा भानॊद भानॊ न् न् न् ऩकड तऩना आकॊ ऩ ऩाठ क्रऩार ु ऩाऩा पऩटायी कपऩर सीपऩ ऩीच सऩीठ छऩी ऩुकाय सऩुदर काऩु ऩूजना अऩूणर ऩाऩु प्र प्र प्र
ऩेखना सऩेया ताऩे ऩैतान ऩाऩैना साऩै ऩोटा सऩोरा ऩाऩो ऩौनी फऩौती साऩौ ऩॊककर सऩॊत सोऩॊ ऩ् ऩ् ऩ्
पटना आपत वप पाटक सपाना इजापा कपकय भाकपमा काकप पीका अपीभ भापी पुरका सपुर सापु पूटना सपूना सापू फ्रतोश नफ्रत फ्र पेनी सपेद रापे पैरना छपैर कापै पोडना सपोड भापो पौज सफ़ौर कापौ पॊ की पॊ पॊ प् प् प् फनाभ फफय अजफ फाहय आफादी गुडॊफा बफकना अॊबफका अॊबफ फीजी सफीर खयाफी फुकचा फफुजा साफु फूकना फफूर साफू ब्रॊगेश ब्र ब्र फेकस सफेये साफे फैठक भफैय काफै फोतर वफोत साफो फौछाय धफौय वाफौ फॊडर प्रफॊद शुफॊ फ् फ् फ्
बकोस बबक साब
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 543
बायत आबाय अमबा भबखायी भभबय आभब बीतय भाबीर छाबी बुकडी भाबुन आबु बूगोर बबूत बाब ुभ्रघत सुभ्रत भ्र बेदक सबेद राबे बैमा सबैद वाबै बोग आबोग वाबो बौचक खबौद वाबौ बॊजन भबॊज भाबॊ ब् ब् ब्
भकान गभक मतीभ भाधुयी आभाद भाभा भभचरी आभभश साभभ भीठा आभीन भाभी भुखौटा अभुख ऩाभु भूसा अभूर साभ ूम्रग अम्रत कम्र भेमय सभेत जाभे भैरा धभैर याभै भोटा आभोद साभो भौजा अभौर सभौ भॊजन आभॊत्रण भॊ भ् भ् भ्
मतीभ ऩामर भम माचक आमात भामा घमभान घम भाघम मीश्वय मी बाशामी मुग आमुध आमु मूनानी सामूर यामू य्र य्र य्र मेन भामेर सामे मै मै मै
मोगी आमोग भामो मौनती समौर घाडमो मॊबत्रक भाटॊक मॊ म् म् म् यकफा आयसी माय याकेट आयाजी भाया रयमाज ऩरयणत ऩरय यीछ ऩयीस ऩयी
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 544
रुकना ऩरुभा ऩारु रूऩा ऩरूर भरू यर यर यर
येखीम आयेख ऩये यैमत सयैर ऩायै योकड आयोऩ कयो यौजा भयौदा जायौ यॊग सयॊग भयॊ य् य् य्
रकडी भरफा पर राट भरारा रैरा
भरखना भभरक भाभर रीडय भरीदा म्रणारी रुकना सरुका भार ुरूभ आरूचा ऩल्रु ल्र ल्र ल्र
रेखन आरेख ऩहरे रैरा सरैभ जारै रोटन अरोक भारो रौकी अरौककक जारौ रॊऩट ऩरॊग सरॊ र् र् र्
वकीर अवभ मुव वाटटका आवाज यवा पवकट आपवरा छपव वीयाना सवीद यवी वुजा सवुय वावु वूपय येवूय कावू व्र आव्रत व्र
वेदना आवेग कयवे वैतार चवैमा भावै वोटय अवोक सावो वौभा वौ वौ वॊटक वॊ वॊ व् व् व्
शकुनी भशक आक्रोश शाकीम भशान शीशा भशकवा आभशक खुभश शीशभ भशीन शीशी शुदा अशुब आशु शूरना बत्रशूर ऩाशू श्रगार श्र श्र
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 545
शेखय भभशेर राश ेशैरा अशैक अक्शै शोभशत अशोक आशो शौहय कशौय भशौ शॊककत बत्रशॊकु शॊ श् श् श् ष ष ष षा षा षा पष पष पष षी षी षी षु षु षु षू षू षू ष्र ष्र ष्र षे षे षे षै षै षै षो षो षो षौ षौ षौ षॊ षॊ षॊ ष् ष् ष्
सकर ककसकी तीस साभभर कसाना बासा भसकट काभसभ शाभस सीखना ऩसीना ऩायसी सुहास जासुभ ऩासु सूऩय जासूय रास ूस्र स्र स्र
सेठानी कसेत बासे सैकडा ऩसैना जासै सोता ऩासोभ ऩासो सौगात भसौना हासौ सॊकट फसॊती रासॊ स् स् स्
हभाया सहया भह हात सुहास साहा टहन्दी भटहरा कटह हीयक सोहीर भाही हुवा गहुना साहु हूयना जाहूभ माहू ह्रतॊत्री ह्र ह्र हेकड भहेश कहे हैयान सहैगा राहै होटर कना ऩाहो
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 546
हौरा डहौना जाहौ हॊत हॊ साहॊ ह् ह् ह्
III. TEXT PROCESSING Text processing is the primary step involved in building Hindi TTS system. Once the orthographic text is available,
before synthesizing pre-processing of text is required [4]. The main intension behind text processing is to resolve the
ambiguity if any present in between two characters. Each and every language has its corresponding Unicode developed
by language research centers and every character has its own identification. This identification codes are used in pre-
processing program to understand better and solve the problem of confusion between two characters [1]. Pre-processing
program can be done in MATLAB, JAVA and many other programming languages but here it is implemented using
.NET programming Language.
TABLE II COMPARISON OF DATABASE
Sl.
No Developed by Unit Language Corpus
1
SJ College of
Engineering.
Mysore [1]
Phoneme Kannada 1605
2 Utkal
University [11] Syllable
Hindi,
Odiya,
Bengali &
Telugu
9317
3 Punjabi
University [10] Syllable Punjabi 3312
4
Carnegie
Mellon
University [9]
Syllable Hindi 2344
5 RIT,[13]
Maharashtra Phoneme Konkani 3000
To resolve the ambiguities present in understanding Hindi alphabets consonants and vowels are grouped into
different classes and programmed [1]. Classification of vowels and consonants are as shown below.
TABLE III. CONSONANT
Alphabets Unicode Decimal
Equivalent
क 0915 2325 ख 0916 2326 ग 0917 2327 घ 0918 2328 ङ 0919 2329
TABLE IX. INDEPENDENT VOWEL
Alphabets Unicode Decimal
Equivalent
अ 0905 2309 आ 0906 2310 इ 0907 2311 ई 0908 2312 उ 0909 2313 ऊ 090A 2314 ऋ 090B 2315
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 547
ए 090F 2319 ऐ 0910 2320 ओ 0913 2323 औ 0914 2324
Similarly consider all other Consonants and grouped as IV, V, VI, VII and VIII. Later Group Dependent vowel
signs which support for forming syllable.
TABEL X: DEPENDENT VOWEL SIGN
TABLE XI: PADDING
Alphabets Unicode Decimal
Equivalent
093E 2366
ाा 093F 2367 न्द्ा 0940 2368 ाी 0941 2369 ाु 0942 2370 ाू 0943 2371 ा 0947 2375 ाे 0948 2376 ाै 094B 2379 ाो 094C 2380 ाौ 094D 2381
Alphabets Unicode Digits Padded
093E ---
ाा 093F 01 न्द्ा 0940 02 ाी 0941 03 ाु 0942 04 ाू 0943 05 ा 0947 06 ाे 0948 07 ाै 094B 08 ाो 094C 09 ाौ 094D 10
The Pre-processor program reads the entered text character by character and generates a modified Unicode file
as output. The modified Unicode file is stored in a text file and imported directly to MATLAB program for further
processing.
A. Rules applied during Pre-processing 1. If character belongs to Independent vowel group as shown in Table IX then its Unicode converted Decimal
value is directly padded with zeroes. E.g. Consider character read is अ its Unicode is 2309 is padded with two zeroes directly. The modified Unicode will be 230900.
2. If character read belongs to consonant group as shown in Table III then check the next set of characters if the next character belongs to dependent vowel sign group then Unicode is padded with corresponding two digit
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 548
value obtained from the Table.10. E.g. Consider character entered is रु then it is divided into य its Unicode is 2352 and ाु its Unicode is 2370 padding and value obtained from Table10 is 04. So modified Unicode value is 235204.
3. If character belongs to consonant group as shown in Table II and the next character also belongs to consonant
group the Unicode is unchanged. E.g. consider the character read is ण its Unicode is 2339 and the next character read is also a consonant then Unicode remains the same 2339.
4. If entered word is अरुण then its Modified Unicode output will be 230900 235204 2339, the presence of spaces between each Unicode helps us to differentiate individual character in entered word.
5. If entered sentence is अरुण कुभाय then its modified Unicode output will be 230900 235204 2339 101010 232504 235001 2352. Unicode 101010 acts as space between two words and is used to differentiate two words during
sentence formation.
IV. SPEECH SYNTHESIS Speech Synthesis and processing is implemented using MATLAB tool. Selecting an appropriate algorithm among
concatenation based TTS system is very important after building database. According to recent studies direct waveform
concatenation algorithm is best suitable for speech synthesis [8].
The MATLAB program uses Modified Unicode file generated by Pre-processing program for this purpose. The
program reads the modified Unicode file, number by number and fetches the appropriate phonemes and syllable from the
database. The presence of spaces in the Modified Unicode file is used to determine the directory from which the syllable
should be fetched i.e. Start, middle or End. The following steps are performed to synthesize the speech. Consider the
word हभाया. The syllable units are fetched separately from respective database and concatenated using suitable algorithm. Fig. 11 shows the concatenated speech output.
Fig. 11 Concatenated output
After concatenation further processing is done using moving average windowing for smoothing the
concatenated output. This will increases the quality of speech output.
V. CONCLUSION This paper discusses the design and development of Hindi text and speech database for concatenation based TTS
system considering syllable as a basic unit. This technique provides very high quality speech output which is reasonably
natural and equivalent to voice of the original speaker. The proposed approach minimizes the co-articulation effect and
prosody mismatch between adjacent units concatenated. This new approach of considering position of syllable during
database building helps us to reduce glitches during concatenation and obtain continuity in concatenated speech and
improved quality speech output compared to normal concatenation done without considering position of character and
duration.
REFERENCES
[1]. Ravi D J and Sudarshan Patilkulkarni (2011), “A Novel Approach to Develop Speech Database for Kannada Text-to Speech System”, Int. J. on Recent Trends in Engineering & Technology, Vol. 05, No. 01.
[2]. Marian Macchi (1993), “Issues in Text-to-Speech Synthesis”. [3]. Kishore S P and Black A (2003), “Unit Size in Unit Selection Speech Synthesis”, in Proceedings of Euro
speech, September, pp. 1317-1320.
[4]. Paul Taylor (2009), “Text-to-Speech Synthesis”, Cambridge University Press. [5]. Lemmety S (1999), “Review of Speech Synthesis Technology”, M.S. Thesis, Dept. Elec. and Comm. Engg.,
Helsinki University of Technology.
[6]. Thomas S (2007), “Natural Sounding Text-to-Speech Synthesis Based on Syllable Like Units”, M.S. Thesis, Indian Institute of Madras.
0 5000 10000 15000-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
-
Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),
May - 2014, pp. 531-549
© 2014, IJARCSSE All Rights Reserved Page | 549
[7]. Arun Kumar C and Shreekanth T (2014), “A Comprehensive review on Concatenation Based Text to Speech Synthesis for Indian Language”,Int. J. Elec&Electr.Eng&Telecoms, Vol. 3, No. 2, April 2014, ISSN 2319 –
2518.
[8]. PRAAT: A tool for phonetic analysis and sound manipulations by Boersma and Weenink, 1992-2001. www.praat.org
[9]. S P Kishore and Alan W Black, “Unit size in Unit selection Speech Synthesis”.EUROSPEECH 2003 – GENEVA.
[10]. Parminder Singh, Gurpreet Singh Lehal. 2006. Text-To Speech Synthesis System for Punjabi Language. In Proceedings of International Conference on Multidisciplinary Information Sciences and Technologies, Merida,
Spain
[11]. Sanghamitra Mohanty, “Syllable Based Indian Language Text To Speech System”, International Journal of Advances in Engineering & Technology, 2011. Vol.1, Issue 2.
[12]. Badri Nath Kapoor, “Practical Hindi-English Dictionary” January 1, 2004. [13]. Pukhraj P. Shrishrimal, Ratnadeep R. Deshmukh and Vishal B. Waghmare, “Indian Language Speech Database:
A Review”. International Journal of Computer Applications (0975 – 888), Volume 47– No.5, June 2012.