a framework for bangla text to speech synthesis

A Framework for Bangla Text to Speech Synthesis

Authors

K. M. Azharul Hasan, Muhammad Hozaifa, Sanjoy Dutta, Rafsan Zani Rabbi

Presented By

Sanjoy Dutta

Department of Computer Science & Engineering

Khulna University of Engineering and Technology, Khulna, Bangladesh.

Authors

Contents

• Problem Statement

• Factors for Speech Synthesis in Bangla

• Proposed Framework • Rules and Structure Development • Syllable Parser Development

• Audio File Selection and Normalization

• Experimental Analysis & Results

• Conclusion

2

Problem Statement

•Develop a framework for Bangla Text to Speech Synthesis.

3

Contents






• Conclusion

4

Factors for Speech Synthesis in Bangla

• Sequential flow of diphones

A diphone is a set of two adjacent phonemes where the transition between two phonemes are modelled, usually from the middle of the first phoneme to the middle of the second phoneme.

A phoneme is a sound or a group of different sounds perceived to have the same function by speakers of the language or dialect in question. Like in English for K/C phoneme: Skill, School.

• Position vs. Pronunciation

Three kinds of position occurs of consonant and vowels:

Constant Vowel(CV)

Vowel Constant(VC)

Vowel Constant Vowel(VCV)

5

Contents






• Conclusion

6

Proposed Framework Structure and Rules

• Text Normalization:

Transforming text into a single standard form.

Used when converting text to speech, numbers, dates, acronyms, and abbreviations.

Text Normalization for Position vs. Pronunciation.

7

Normalization rules for ‘ ’

8

Normalization rules for ‘ - - -’

9

Syllable Parser Development

10

Syllable Parser In Action

11

Contents






• Conclusion

12

Audio File Selection and Normalization

Total 39 consonants 11 vowels in Bangla

After Reduction

28 independent consonants

8 (the vowel ’ ‘ is the exception) vowel

13

Audio File Selection and Normalization

Finally 224 (28*8) audio files for the syllables.

28 consonant against 5 vowels to generate

140 (28*5) diphones.

In summary, we need (9 vowels, 28

consonants, 224 syllables and 140 diphones)

401 audio files to be created.

14

Contents






• Conclusion

15

Experimental Analysis and Results

Strategy of Analysis:

Sample Input Test: Various News Articles from News Portals

Listeners Selection: Anonymous Personals Chosen Randomly

Accuracy Analysis:

Accuracy = 𝑊𝑜𝑟𝑑𝑠 𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟𝑠 𝑤𝑒𝑟𝑒 𝑎𝑏𝑙𝑒 𝑡𝑜 ℎ𝑒𝑎𝑟 𝑜𝑛 1𝑠𝑡 𝑎𝑡𝑡𝑒𝑚𝑝𝑡 𝑐𝑙𝑒𝑎𝑟𝑙𝑦∗100

𝑇𝑜𝑡𝑎𝑙 𝑁𝑜. 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑒𝑣𝑒𝑟𝑦 𝑠𝑎𝑚𝑝𝑙𝑒

16

Experiment Result Listening Factors:

• Duration Synchronization and

Merging

• Numerical Value like years

Constrains in Sample 1:

‌ , , ,

, , ,

Constrains in Sample 2:

, , , , ,

,

17

Limitations and Future Works

Detect Noun and Adjective words namely

( ) Noun and

( ) Adjective

both words should follow the rule 3(a) .

But they don't follow the rule 3(a) and their pronunciation is different.

18

CONCLUSION

We believe the proposed framework can be useful for Bangla TTS development to detect the Bangla words with minimum audio file requirement.

19

Thank You !!!

20

a framework for bangla text to speech synthesis

Technology

normalization rules

framework rules

bangla text

rules text normalization

bangla tts development

proposed framework structure

bangla words

speech synthesis authors