diphone set cover for speech synthesis - a java implementation

DIPHONE SET COVER FOR SPEECH SYNTHESIS – A JAVA IMPLEMENTATION

Stephen Hogan (50217631)Postgraduate Diploma in IT (Evening)

Dublin City University School of [email protected]

Abstract The primary purpose of this paper is to present a programmatic solution to minimise the recording of a speech-corpus, by defining a minimum set of words that completely cover all unstressed diphones required to synthesise every possible word in the CMU Pronouncing Dictionary.

The aim is to design text for recording read speech corpora for concatenative text-to-speech systems. Speech corpora design is one of the key issues in building high quality text to speech synthesis systems. Often read speech is used, since it seems to be the easiest way to obtain a recorded speech corpus with highest control of the content.

We will examine the application of the “Greedy Algorithm” for text selection by proposing a Java implementation. While interest in speech synthesis is aimed at enhancing naturalness of synthesised speech in keeping with the perceptual quality of the synthetic speeches, we are more concerned with achieving minimal complexity, while maintaining minimal computational overhead.

This paper discusses the implementation with reference to the CMU Dictionary1 symbol systems for phonetic representation, whereby we are offered the use of 39 phonemes and over 125,000 words in a text-file dictionary.

1. INTRODUCTION

Speech synthesis is the process of converting written text into machine-generated speech. Concatenative speech synthesis is based on the idea of concatenating pre-recorded speech units to construct the utterance. Concatenative systems tend to be more natural than other systems, such as articulatory and formnant [8], since original speech recordings are used instead of models and parameters. In concatenative systems, the basic units of speech may be either variable-length units such as syllables and phones, (also known as unit selection), or as fixed-size diphones, the latter is discussed in this paper to achieve the optimal set of utterances.

Diphones are the preferred, and a widely-used method, of defining the foundational elements of synthesis. A diphone begins from the central point of the steady state part of one phone and ends at the central point of the subsequent phone, containing the transition between the two phones [4]. The rationale for using a diphone is that the centre of a phonetic realisation is the most stable region, whereas the transition from one segment to another contains the most interesting phenomena, and thus the hardest to model. The diphone, then, cuts the units at the points of relative stability, rather than at the volatile phone-phone transition, where so-called coarticulatory effects appear [9]. (This makes the incorrect but practical and simplifying assumption that co-articulatory effects never go over more than two phones [10]).

The main aim of using the Set Cover Problem, and more specifically, the Greedy Algorithm, is to select all diphones optimally, so that the entire list of the CMU Pronouncing Dictionary is covered by the minimum number of words, while minimising redundant data, (repetition of diphones).

In this paper, we first define the concept of coverage via the concept of “unit” and then describe the Java implementation for optimising coverage.

2. DIPHONE SET GENERATION

Many researchers use the greedy algorithm for speech corpus design [2, 3, and 4]. The best-known algorithm is the greedy algorithm as applied to the set cover problem [4]. It is a simple iterative technique for constructing a

1 The Carnegie Mellon University Pronouncing Dictionary http://www.speech.cs.cmu.edu/cgi-bin/cmudict

subset of diphones from a large set of dictionary words and phonemes to cover the largest diphone-unit space with the smallest number of dictionary words. Prior to the selection process, the target space to be covered needs to be defined by the unit definition, mainly the feature space of a unit – a diphone.

2.1 Preparation of the Unit Definition

As per Table 1, CMU provides us with a list of 39 phonemes, not counting varia for lexical stress, as per Table 1:

AA AE AH AO AW AY B CH D DH EH ER EY

F G HH IH IY JH K L M N NG OW OY

P R S SH T TH UH UW V W Y Z ZH

Table 1: A table displaying a list of the phonemes employed in the CMU Pronouncing Dictionary.

We need to enumerate these phonemes, along with silence, denoted by ‘_’ (sans quotes) into a set of diphones. Therefore this set’s cardinality would be 1,600.

As an example of a diphone is presented in Figure 1, assuming some amount of on-glide and off-glide, the diphones of the word 'then' could be as [1]:

Phone Centrepoint

Phone Centrepoint

Phone Centrepoint

th th th e e e n n n_DH DH-EH

diphoneEH-N N_

diphone

Figure 1: An alphabetic and phonetic representation of the diphones for the word ‘then’.

Ideally, though, we would like to reduce this diphone set to the minimum required to cover the unit space, i.e. all of the diphones contained within the dictionary.

2.2 Preparation of the Unit Space

As the dictionary in use contains a list of words and for each the phonemes required to pronounce the word, it is first necessary to generate a list of all the diphones associated with pronouncing the word.

All relevant phonetic realisations can be enumerated, and that by simply collecting all of phone-phone transitions, that any possible sequence of speech sounds in the English language could be produced. Thus, with a 40-phone inventory (including silence), one could collect a 40 * 40 = 1600 diphone inventory and create a synthesizer that could speak anything, (given the imposition of appropriate prosody. intonation, duration, and shift in spectral quality, as determined by other modules in a general-purpose synthesizer) [9]. Enumerating all of the diphones for all words into a set, repetitious diphone elements would be removed, while diphones not experienced yet would be newly added to the set; therefore we would end up with a set of all diphones that cover all words in the dictionary.

However, in natural human languages, there are phonotactic constraints – some phone-phone pairs, even whole classes of phones-phone combinations, may not occur at all. Humans can often generate those so-called non-existent diphones if they try, and one must always think about phone pairs that cross-over word boundaries as well, but even then, certain combinations cannot exist; for example, /hh/ /ng/ in English is probably impossible. /ng/ may really only appears after the vowel in a syllable (in coda position); however, in other languages it can appear in syllable-initial position. /hh/ cannot appear at the end of a syllable, though sometimes it may be pronounced when trying to add aspiration to open vowels [10].

http://www.speech.cs.cmu.edu/cgi-bin/cmudict

Therefore it is safe to assume that from the unit space, we can investigate which of the diphones are not being used as defined in section 2.1, thereby reducing the number of diphones in our unit definition that are required.

3. ALGORITHM

Consider a set of words, and a parallel set that contains for each word a list of diphones occurring for the corresponding word. How do we select a small set of words so that their corresponding diphones contain each diphone in the larger list at least once?

This problem has a well-known approximate solution in the form of the greedy algorithm. This algorithm successively selects words so that the first word is the word with the largest diphone type count; all diphones occurring in that word are removed from the larger list. Once N words have been selected, the next word selected is the word with the largest count of the remaining diphones [12].

4. APPLICATION

4.1 Text Pre-Processing

The application that implements the algorithm is implemented in Java, where the main constructs of data handling are either lists, sets or maps. All possible diphones were created from the CMU phonemes list, and along with silence, a diphone set of all possible diphones is created, (even those that may not be used).

Then from the CMU Pronouncing Dictionary, each line is analysed by means of string parsing and from the phonemes assigned to each word, a diphone list is created and this list is added to a diphonetic dictionary set. The advantage of using a set in this case is that repetitious diphones are ignored, and diphones not yet come-across are added. Once the reading-in of the dictionary is exhausted, we now have a diphone dictionary set with all unique diphones required to utter any and all words in the dictionary.

With this diphonetic dictionary set, it is possible to remove the diphones not experienced, thus removing unused diphones in the following manner:

Let diphoneSet = {all possible diphones from CMU}. Let diphoneticDictionarySet = {all diphones generated and

extracted from the CMU Pronouncing Dictionary}…where diphoneticDictionarySet diphoneSet.

Then diphoneSet \ diphoneDictionarySet = {unused diphones}.

diphoneSet \ {unused diphones} = {all diphones necessary to cover al diphones associated with the words in the CMU Pronouncing Dictionary}.

Then we establish a list of words where there are no repeating diphones within the word. This guarantees that each word’s associated diphone list is only created with a single instance of diphones. From this list, we count the number of diphones per word.

4.2 Data Structure

The data structure employed to contain these unique-diphone words is a MultiMap [11], whereby the key (index) being the number of diphones per word is mapped to a list of multiple elements, each being a diphonetic entry – consisting of the word, the diphones associated with that word, and the number of diphones, which correlate to the map’s key.

Starting with the largest key and iterating backwards, each associated list is examined per word; thereby we begin with the largest words with unique diphones as a set to subtract from the diphone set, as the larger the word, the more diphones covered, and the fewer words required to be subtracted from the set. Obviously, we remove the word from the map once its diphones have been successfully removed from the set. We iterate

through all of the words in the map in this fashion until the diphone set is empty.

5. RESULTS & DISCUSSION

The main interface of this system is shown in Figure 2. From a dictionary size of 133,720 words, the minimal word list to cover all diphones was reached at 924, and this minimal list is presented in a Java GUI format:

Figure 2: The Java GUI displaying the minimal word list.

Note that this method of capturing a minimal list of the words is sub-optimal, as we may select words from the map that contain a number of diphones that have already been subtracted from the set. While added to the minimal list of words being compiled, these words are in fact redundant.

It is also worth noting that as the words from the diphone set are being removed, that with the constraint that the number of diphones per word must be less that the cardinality of the diphone set, the algorithm tends to slow down as it tends to zero.

6. CONCLUSIONS & RECOMMENDATIONS

This research has proposed a programmatic selection method of the minimum number of words based on the CMU Pronouncing Dictionary, that cover all of the possible diphones that may be generated from the phonemes in this dictionary.

As future work, a prosody generation module may be added to address intonation and duration modelling of the language. This is also outlined in [8].

Work has already been done modifying the greedy selection method to achieve maximum variability of units in a text design for speech corpora construction problem and improve the content richness of text corpus obtained by greedy selection [2].

With minor modification to the main Java application, it would be possible for other languages to be considered with similar phonetic dictionary and phoneme set accessible by files.

7. REFERENCES

[1] Altwarg, R. Diphone Definition for Mandarin Speech Synthesis. Macquarie University, Speech and Language Processing. SLP807. Speech Synthesis.(http://www.shlrc.mq.edu.au/masters/students/raltwarg/di_home.htm) [2] Bozkurt, B., Dutoit, T., and Ozturk, O. (2003). Text Design For TTS Speech Corpus Building Using A Modified Greedy Selection. Proc. EUROSPEECH, European Conference on Speech Communication and Technology, Geneva. pp 277-280.(http://tcts.fpms.ac.be/publications/papers/2003/eurospeech03_bbootd.pdf)

http://tcts.fpms.ac.be/publications/papers/2003/eurospeech03_bbootd.pdf

http://www.shlrc.mq.edu.au/masters/students/raltwarg/di_home.htm

[3] Chitturi, R., Mariam, S. H. and Kumar, R. (2005) Rapid Methods for Optimal Text Selection. Recent Advances in Natural Language Processing, Borovets, Bulgaria.(http://web.iiit.ac.in/~rahul_ch/optimal_specom.pdf) [4] Cormen, T., Leiserson, C. and Rivest, R. (1990) Introduction to Algorithms. The MIT Press, Cambridge, Massachusetts.

[5] Dutoit T. (1997) An Introduction to Text-to-Speech Synthesis. Dordrecht/Boston/London: Kluwer Academic Publishers, vol. 3.

[6] François, H. and Boëffard, O. (2001) Design of an Optimal Continuous Speech Database for Text-To-Speech Synthesis Considered as a Set Covering Problem. Proc. of Eurospeech, Aalborg, Denmark.(www.irisa.fr/cordial/hfrancoi/publis/HF_OB_eurospeech01.ps)

[7] François, H. and Boëffard, O. (2002) The Greedy Algorithm and its Application to the Construction of a Continuous Speech Database. Proc. LREC 2002, pp.1420–1426. [8] Hasim, S., G. Tunga and S. Yasar. (2006) A Corpus-Based Concatenative Speech Synthesis System for Turkish. Turk. J. Elect. Eng. Comput. Sci.,14: 209-223.(http://www.cmpe.boun.edu.tr/~gungort/papers/A%20Corpus-Based%20Concatenative%20Speech%20Synthesis%20System%20for%20Turkish.pdf)

[9] Lenzo, K. and Black, A. (2000). Diphone Collection and Synthesis.International Conference on Spoken Language Processing, ICSLP 2000. Beijing, China, vol. III, pp. 306–309.(http://www.cs.cmu.edu/~awb/papers/ICSLP2000_diphone.pdf)

[10] Lenzo, K. and Black, A. (2007) Diphone Databases: Defining a Dijphone List. FestVox: Building Synthetic Voices. Language Technologies Institute, Carnegie Mellon University.(http://festvox.org/bsv/c2261.html)[11] Multimaps. The Map Interface – The Java Tutorials.(http://java.sun.com/docs/books/tutorial/collections/interfaces/map.html) [12] van Santen, J P. H. and Buchsbaum, A. L. (1997) Methods for Optimal Text Selection. Proc. of Eurospeech, p. 553-556, Rhodes, Greece.(http://adambuchsbaum.com/papers/eurogreedy.pdf)

[13] van Santen, J.P.H., Sproat R.W., Olive, J.P. and Hirschberg, J. (1996) Progress in Speech Synthesis. Springer.

http://adambuchsbaum.com/papers/eurogreedy.pdf

http://java.sun.com/docs/books/tutorial/collections/interfaces/map.html

http://festvox.org/bsv/c2261.html

http://www.cs.cmu.edu/~awb/papers/ICSLP2000_diphone.pdf

http://www.cmpe.boun.edu.tr/~gungort/papers/A%20Corpus-Based%20Concatenative%20Speech%20Synthesis%20System%20for%20Turkish.pdf



http://www.irisa.fr/cordial/hfrancoi/publis/HF_OB_eurospeech01.ps

http://web.iiit.ac.in/~rahul_ch/optimal_specom.pdf

APPENDIX 2 – SOURCE CODE

Diphone.java

Double-Click to open the entire embedded document.

DiphoneticDict.java

Double-Click to open the entire embedded document.

APPENDIX 3 – SCREENSHOTS OF THE JAVA IMPLEMENTATION

Clicking on the START button:

Scrolling down to the end:

diphone set cover for speech synthesis - a java implementation

Documents