diplomarbeit im fach informatik grapheme to phoneme conversion using crfs with integrated alignments...

Diplomarbeit im Fach Informatik

Grapheme to Phoneme ConversionUsing CRFs with Integrated Alignments

Rheinisch-Westfalische Technische Hochschule Aachen

Lehrstuhl fur Informatik 6Prof. Dr.-Ing. H. Ney

vorgelegt von:Andrei Guta

Matrikelnummer 252126E-Mail [email protected]

Gutachter:Prof. Dr.-Ing. H. Ney

Prof. Dr. B. Leibe

Betreuer:Dipl.-Phys. Patrick LehnenDipl.-Inform. Stefan Hahn

May 17, 2011

mailto:[email protected]

Erklarung

Hiermit versichere ich, dass ich die vorliegende Diplomarbeit selbststandig verfasstund keine anderen als die angegebenen Hilfsmittel verwendet habe. Alle Textauszugeund Grafiken, die sinngemaß oder wortlich aus veroffentlichten Schriften entnommenwurden, sind durch Referenzen gekennzeichnet.

Aachen, May 17, 2011

Andrei Guta

v

Abstract

The application of Conditional Random Fields (CRFs) has achieved state-of-the-art re-sults in many natural language processing tasks. In order to apply CRFs to Grapheme-to-Phoneme (G2P) conversion, an alignment has to be provided for the training of themodel parameters. The computation of an alignment using an external model in apreprocessing step introduces both, an error propagation and the dependency on anadditional model. Thus, instead of computing a single best alignment, we consider allmatching alignments with respect to certain constraints in the training of the modelparameters.Moreover, we further develop the CRF training software, which was used for taggingtasks before, such that it becomes fully applicable regarding the G2P corpora we usefor the training and the evaluation of the CRF model. This implies a decouplingof the target sequences from the alignment, which allows for separate alignment andtarget sequence features. An additional contribution of this thesis is the inclusion of1-to-2 alignment links into the CRF software, which is necessary for the majority ofthe G2P corpora. Furthermore, we address the convexity problem, which arises due tothe integration of alignments as latent variables into our CRF model, and investigatevarious initializations of the model parameters.We have evaluated the CRF software on two English G2P corpora: Celex and NETtalk.Firstly, the approach of CRFs with integrated alignments is shown to outperformthe single best alignment approach. Secondly, the combination of the CRF softwareincluding 1-to-2 alignment links, parameter initialization and decoupled features andthe integration of alignments into the CRF model training leads to new state-of-theart results in G2P.

vii

Acknowledgment

Firstly, I would like to thank Prof. Dr.-Ing. Hermann Ney for the interesting possibili-ties at the Chair of Computer Science 6 of the RWTH Aachen University of Technologywithin the scope of this diploma thesis as well as during my whole work as a studentassisstant since May 2007.Moreover, I would like to thank Prof. Dr. Bastian Leibe, who kindly accepted toco-supervise the work.Particularly, I would like to thank Patrick Lehnen and Stefan Hahn for the greatsupervision of my work and their share of knowledge, constant support and harmonicassistance.Special thanks go to my friend Alex Dale for proofreading the manuscript as well asfor his helpful remarks concerning English grammar.Finally, I would like to thank my girlfriend Manuela Isbaner, my parents Dipl.-Ing.Gabriela Guta and Dipl.-Ing. Dan Guta, my grandparents Livia Guta, Ifigenia Oanceaand Prof. Dr.-Ing. Nicolae Oancea for their tireless and cordial support as well as fortheir efforts to create the necessary conditions for me to write this thesis and completemy studies successfully.

ix

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Joint-Sequence Models . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Online Discriminative Training . . . . . . . . . . . . . . . . . . 41.2.3 Integrating Joint n-gram Features into a Discriminative Training

Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.4 Letter-Phoneme Alignment: An Exploration . . . . . . . . . . . 51.2.5 Hidden Conditional Random Fields . . . . . . . . . . . . . . . . 61.2.6 Hidden Dynamic Probabilistic Models for Labeling Sequence Data 7

1.3 Scope of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Conditional Random Fields for G2P 92.1 Grapheme-to-Phoneme Conversion . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Maximum Approach . . . . . . . . . . . . . . . . . . . . . . . . 112.1.2 Summation Approach . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Linear Chain CRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Training Criterion, Algorithm and Regularization . . . . . . . . 132.3.2 Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.3 Forward-Backward Algorithm . . . . . . . . . . . . . . . . . . . 162.3.4 Runtime Complexity . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Alignments 233.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Alignment Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 Monotonicity Constraint . . . . . . . . . . . . . . . . . . . . . . 263.2.2 Representation Constraint . . . . . . . . . . . . . . . . . . . . . 263.2.3 Mapping Constraint . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 M-to-1 Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.1 Extended Grapheme Sequences . . . . . . . . . . . . . . . . . . 283.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

xi

3.3.3 Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4 1-to-2 Alignment Links . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.5 External Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.6 Resegmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.7 Summation over Alignments . . . . . . . . . . . . . . . . . . . . . . . . 383.8 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.9 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Features 434.1 Tagging CRF Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.1 Single Lexical Feature . . . . . . . . . . . . . . . . . . . . . . . 444.1.2 Combined Lexical Feature . . . . . . . . . . . . . . . . . . . . . 454.1.3 Transition Feature . . . . . . . . . . . . . . . . . . . . . . . . . 454.1.4 Decoupling of Alignments and Target Sequences . . . . . . . . 46

4.2 CRF Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2.1 Single Lexical Feature . . . . . . . . . . . . . . . . . . . . . . . 464.2.2 Combined Lexical Feature . . . . . . . . . . . . . . . . . . . . . 46

4.3 CRF Model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3.1 Alignment Feature . . . . . . . . . . . . . . . . . . . . . . . . . 474.3.2 Bigram Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 CRF Model 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.4.1 Alignment Feature . . . . . . . . . . . . . . . . . . . . . . . . . 484.4.2 Bigram Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 CRF Model 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.5.1 Bigram Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.6 CRF Model 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.6.1 Single Lexical Feature . . . . . . . . . . . . . . . . . . . . . . . 494.6.2 Combined Lexical Feature . . . . . . . . . . . . . . . . . . . . . 494.6.3 Alignment Feature . . . . . . . . . . . . . . . . . . . . . . . . . 504.6.4 Bigram Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Experiments and Results 535.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3.1 Celex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.3.2 NETtalk 15k . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.4 Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.5 Initializations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.6 Doubled Grapheme Sequences . . . . . . . . . . . . . . . . . . . . . . . 60

xii

5.7 Grapheme Sequence Extension using the Joint n-gram Model . . . . . 635.8 1-to-2 Alignment Links . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6 Conclusion and Outlook 69

List of Figures 71

List of Tables 73

Bibliography 75

xiii

Chapter 1

Introduction

During the last years, CRFs have emerged as a state-of-the-art discriminative mod-elling framework for many Natural Language Processing (NLP) tasks, for instanceconcept tagging (see [Hahn & Lehnen+ 08] and [Hahn & Lehnen+ 09]) and part-of-speech (POS) tagging (see [Lafferty & McCallum+ 01]). In these domains, one doesnot have to face an alignment problem, because it is usually annotated within thecorpus. On the other hand, there are various applications, e.g. Statistical MachineTranslation (SMT), Automatic Speech Recognition (ASR) and G2P conversion, inwhich the alignment is usually unknown. Generally, one makes use of an externalmodel to compute the single best alignment in a preprocessing step.Based on the fact that this precomputed alignment introduces both a dependencyon an external model and a possible error propagation, the general task of this workconsists of integrating the alignment generation directly into the CRF training process.We have chosen the G2P conversion task as the application domain, since this allowsfor certain constraints regarding the alignments. In a nutshell, G2P conversion consistsof generating the pronunciation of a given word, i.e. the most probable sequence ofphonemes for a given sequence of graphemes.

1.1 Motivation

Due to the fact that CRFs allow for state-of-the-art results in many NLP applicationswith given alignment annotations, we focus on further improving them for those tasks,in which the alignment is unknown. Because an alignment is needed for the training ofthe CRF model, the usual approach is to compute the most probable one in a prepro-cessing step before the actual training of the CRF model. This approach introducestwo disadvantages: firstly, one has to make use of an additional external model, whichincludes additional steps of both training and tuning. Secondly, the precomputedalignment is optimized for the external model, but not for the CRF model. Moreover,it implies an error propagation, since it is computed using a statistical model.In order to get rid of both, the additional steps in training and the error propagation,we like to integrate the alignment generation into the CRF model training. So as toperform the integration, we implement two different approaches. On the one hand,

1

we can begin with a linear segmentation, improve it iteratively and once it convergeskeep it fixed for the actual CRF model training. On the other hand, we can drop theidea of sticking to one fixed alignment and consider all possible alignments during ourwhole CRF training.We have chosen G2P conversion as an application task for various reasons. Firstly,CRFs lead to state-of-the-art results for multiple NLP tasks, all of which have one com-mon aspect, namely monotonic alignments. These permit a much smaller search spaceand are a significant feature of the G2P conversion task. The G2P conversion taskis defined as follows: for a given grapheme sequence representing a written word, onehas to generate the phoneme sequence which represents its most likely pronunciation.A main application of G2P conversion are text-to-speech (TTS) systems[Schroeter & Conkie+ 02]. Furthermore, G2P is of particular importance forspelling correction [Toutanova & Moore 02] and speech-to-speech translation systems,which have to infer the pronunctiation of target language sequences, as shown in[Engelbrecht & Schultz 05]. In addition, G2P conversion plays a crucial role in theenrichment of pronunciation dictionaries in ASR, i.e. especially for named entitieswhose pronunciations are usually not contained in standard lexica.

“throwing”[Tr5IN] =

th[T]

r[r]

ow[5]

i[I]

ng[N]

“proxy”[prQksI] =

p[p]

r[r]

o[Q]

x[ks]

y[I]

Figure 1.1. Examples of grapheme-phoneme sequence pairs and possible alignments.

1.2 Related Work

Several works concerning the G2P conversion task have been published in recent years.In the following subsections, we present some of the most relevant to our approach.

1.2.1 Joint-Sequence Models

A main contribution to the field of G2P conversion is given by [Bisani & Ney 08], wherejoint-sequence models are applied. The fundamental idea is based on the concept ofa graphone q, which is a pair of a grapheme sequence fq and a phoneme sequence eq,q = (fq, eq). Hence, a sequence of graphones generates for each word its orthographicform and its pronunciation. Usually there are multiple sequences of graphones that allrepresent the same pair of graphemes and phonemes, i.e. for each possible alignment

2

of a grapheme sequence and its corresponding phoneme sequence there is a uniquegraphone sequence. Since a graphone is a pair of sequences, multiple graphemes canbe aligned to multiple phonemes, which implies a many-to-many (M-to-N) alignment.For instance, the word “mixing” and its pronunciation [mIksIN] have the followingalignments among others, shown in Figure 1.2.

“mixing”[mIksIN] =

m[m]

i[I]

x[ks]

ing[IN]

“mixing”[mIksIN] =

m[m]

i[I]

x[k]

-[s]

i[I]

n[-]

g[N]

Figure 1.2. Example of different possible alignments for a grapheme-phoneme sequence pairusing the joint-sequence model.

The joint probability p(f, e) of the grapheme sequence f and the phoneme sequencee is determined by summing over all matching alignments, i.e. over all graphonesequences qK1 ∈ A(f, e), where A(f, e) is the set of all graphone sequences which yieldf respectively e when their elements are concatenated. Thus, the joint probabilityp(f, e) has been reduced to the probability distribution p(qK1 ) over matching graphonesequences qK1 which is modeled using an M -gram approximation:

p(f, e) =∑

qK1 ∈A(f,e)

p(qK1 )

=∑

qK1 ∈A(f,e)

K+1∏k=1

p(qk|qk−1, . . . , qk−M+1)

The model parameters are estimated by maximizing the log-likelihood of the train-ing data using the expectation maximization (EM) algorithm, their initialization isa uniform distribution over all graphones q with |fq| ≤ L and |eq| ≤ L for a lengthconstraint L. Moreover, smoothing and trimming are applied to avoid overfitting onthe training data. Both, the size of the M -gram and the upper limit L are the externalparameters of the joint sequence model and determine the context size, i.e. the numberof graphemes and phonemes which affect the estimated probabilities at a given posi-tion. Experiments on NETtalk 15k for L-to-L models show on the one hand, that thehigher the M -gram size and L are, the better performs the model on the developmentcorpus and on the other hand, that the situation regarding L is reversed for M ≥ 4,i.e. in this case it holds that the lower L is the better are the results. Thus, with4-grams (and beyond) one has to stick to singular graphones, i.e. L = 1, to obtain

3

the best results. In addition, for M = 6 it is shown that 1-to-L models perform betterthan L-to-L models on the NETtalk 15k corpus and that the best results are achievedwhen using 1-to-1 models. This is probably due to the fact that most words have lessphonemes than graphemes. Although the summation over matching alignments hasbeen tested also in the inference of the most probable phoneme sequence, it has notyielded any improvement.

1.2.2 Online Discriminative Training

In [Jiampojamarn & Cherry+ 08] and [Jiampojamarn & Kondrak 09], the authorspropose a joint processing and discriminative training for G2P conversion. In order totrain the model, the EM algorithm is used for the training of a mapping probability onthe word-pronunciation pairs. Once this probability converges, the Viterbi algorithmis applied to compute the single best M-to-N alignment limited to M, N ≤ 2, whichincludes epsilons on the source side, i.e. empty phonemes. At test time, a monotonephrasal decoder [Zens & Ney 04] is applied.The scoring model α · Φ(f, e), where Φ(f, e) is a binary feature set and α a weightvector, contains various binary features: context features that model the dependency ofa phoneme on the corresponding sequence of substrings, first-order transition featuresand linear chain features, which are a combination of context and transition features.The feature weights are updated online using the Margin Infused Relaxed Algorithm(MIRA, see [Crammer & Singer 03]) based on the n-best outputs. Moreover, the bestresults on the Celex development set (see Section 5.2) are obtained for a source contextsize of 5.

1.2.3 Integrating Joint n-gram Features into a Discriminative TrainingFramework

[Jiampojamarn & Cherry+ 10] develop the online discriminative training presentedabove further. Before the actual training of the model parameters, the single best M-to-N alignment is computed in a preprocessing step using a variant of the EM algorithmdescribed in [Jiampojamarn & Kondrak+ 07]. The model parameters are trained usingthe online discriminative training presented in [Jiampojamarn & Cherry+ 08] and theparametes are updated using MIRA.In addition to the features already described in [Jiampojamarn & Kondrak 09], thesystem contains joint n-gram features. These binary features model the joint proba-bility of the aligned subsequences of graphemes and phonemes at the current position iand up to n−1 consecutive pairs of aligned subsequences of graphemes and phonemesin the context window of [i− n+ 1, i− 1].The authors report that practical application of higher order joint n-gram featuresrequires the exact search algorithm to be replaced with a beam search with a beam

4

size of 50. For the G2P conversion experiments, joint n-gram features with n ≤ 6 areused. [Jiampojamarn & Cherry+ 10] outperform [Bisani & Ney 08] on various datasets, including Celex (see Section 5.2).

1.2.4 Letter-Phoneme Alignment: An Exploration

In comparison to [Jiampojamarn & Cherry+ 10], where the online dis-criminative training presented in [Jiampojamarn & Cherry+ 08] and[Jiampojamarn & Kondrak 09] is improved through additional joint n-gram features,an alternative approach is to improve the performance of the online discriminativetraining by increasing the quality of the precomputed fixed alignment, as presentedin [Jiampojamarn & Kondrak 10]. Of the various proposed alternatives to theM-to-N alignment, which is computed using an algorithm that is based on EM (see[Jiampojamarn & Kondrak+ 07]), we focus on the alignment by aggregation approach.It is based on the idea of using the M-to-N alignment algorithm presented in[Jiampojamarn & Kondrak+ 07] to create an n-best list of one-to-many alignments.Based on empirical optimization, the size of the n-best list is set to n = 5. In addition,all alignments that have a score below the threshold R = 0.8 with respect to the bestalignment are discarded. This threshold is also empirically optimized. In case there isa disagreement between alignments in the n-best list, the one-to-many alignments areaggregated into M-to-N alignments by overlapping.For instance, if the n-best list contains the two alignments shown in Figure 1.3, thereis a disagreement considering the first two alignment links. In the first entry, “p”is aligned to [f] and “h” to an empty phoneme, whereas “p” is aligned to an emptyphoneme and “h” to [f] in the second entry.

p[f]

h[-]

r[r]

a[e]

s[z]

e[-]

p[-]

h[f]

r[r]

a[e]

s[z]

e[-]

Figure 1.3. Example of two disagreeing n-best list entries in the alignment by aggregationapproach in [Jiampojamarn & Kondrak 10].

Hence, there is a disagrement between the entries and therefore, both, “p” and “h”are aligned to [f] in the resulting M-to-N alignment, as presented in Figure 1.4.In comparison to the other alternative alignments introduced in[Jiampojamarn & Kondrak 10] and the M-to-N alignment computed using thealgorithm presented in [Jiampojamarn & Kondrak+ 07], the alignment by ag-

5

ph[f]

r[r]

a[e]

s[z]

e[-]

Figure 1.4. Example of the resulting M-to-N alignment in the alignment by aggregationapproach in [Jiampojamarn & Kondrak 10] for the two entries shown in Figure 1.3.

gregation performs best on all corpora that are used. Moreover, combining thealignment by aggregation with the online discriminative training presented in[Jiampojamarn & Cherry+ 08] and [Jiampojamarn & Kondrak 09] also outperformsthe joint n-gram approach of [Bisani & Ney 08] on all corpora that have been used.

1.2.5 Hidden Conditional Random Fields

A discriminative model for classification based on CRFs with latent variables, namedHidden Conditional Random Fields (HCRFs), is presented in [Quattoni & Wang+ 07].Classification can be described as the task of predicting a label e for the observationsfJ1 . Moreover, aJ1 are the latent variables not observed on training examples, whereeach aj ∈ A and A is a finite set of possible hidden labels in the model. The log-linearmodel has the form

pΛ(e|fJ1 ) =∑aJ1

pΛ(e, aJ1 |fJ1 ) =

∑aJ1

exp(HΛ(e, aJ1 , f

J1 ))

∑e

∑aJ1

exp(HΛ(e, aJ1 , f

J1 )) , (1.1)

where Λ are the model parameters and HΛ(e, aJ1 , fJ1 ) is a potential function param-

eterized by Λ. For the estimation of the model parameters, the log-likelihood of thetraining data, to which a Gaussian prior is added to avoid overfitting, is maximizedusing the gradient ascent algorithm. Due to the latent variables, the optimizationproblem is not convex.The HCRF model is evaluated on two tasks: object and gesture recognition, thelatter being subdivided into arm and head gesture recognition. In these experiments,the potential function is restricted to features dependent on an observation fj anda hidden state aj , as well as to features dependent on the label e and one or twohidden states. Experiments in object recognition show a clear improvement of theperformance, when the model incorporates features dependent on two latent variables,i.e. when the dependency between latent variables is taken into account. Moreover,when trained and evaluated for gesture recognition, the HCRF model is shown tooutperform CRFs and HMMs. In both, arm and head gesture recognition the HCRFperfomance is improved when the source window size is increased to 1, i.e. when themodel is dependent on f j+1

j−1 instead of only fj .

6

1.2.6 Hidden Dynamic Probabilistic Models for Labeling Sequence Data

On the one hand, HCRFs output only one label e for the entire observation sequencefJ1 , i.e. they are not able model dependencies between output labels, but betweenhidden variables. On the other hand, CRFs do not incorporate hidden variables, butmodel dependencies between output labels. In order to combine the advatages ofboth models, [Yu & Lam 08] introduce Hidden Dynamic Conditional Random Fields(HDCRFs).For an input sequence fJ1 , an output sequence eJ1 and a hidden variable sequence aJ1 ,where each aj is restricted to a finite set of possible values dependent on the outputej , i.e. aj ∈ Aej , and A is the set of all possible hidden variable values, the trueprobability of an output sequence for a given input sequence can be formulated as:

Pr(eJ1 |fJ1 ) =∑

aJ1∈AJ

Pr(eJ1 |aJ1 , fJ1 ) · Pr(aJ1 |fJ1 ). (1.2)

The restriction of the hidden variables aJ1 with aj ∈ Aej implies that Pr(eJ1 |aJ1 , fJ1 ) = 0,if there is an aj with aj /∈ Aej . In addition, the sets of hidden states associated witheach output label are disjoint. Hence, Equation 1.2 is reduced to:

Pr(eJ1 |fJ1 ) =∑

aJ1 :aj∈Aej

Pr(aJ1 |fJ1 ). (1.3)

The probability Pr(aJ1 |fJ1 ) is modelled using CRFs:

pΛ(aJ1 |fJ1 ) =1Z

exp( J∑j=1

K∑k=1

λkhk(aj−1, aj , fJ1 , j)

), (1.4)

where Z is the normalization constraint, Λ = {λK1 } the model parameters for thecorresponding set of features {hK1 }. In order to train the model, the log-likelihoodfunction, which contains a zero-mean Gaussian prior distribution (regularization withL2-norm) to avoid over-fitting on the training data, is maximized. The gradients arecomputed using belief propagation [Pearl 88]. Once the model is trained, it can beused to infer the most probable label sequence for a given observation sequence: aftercomputing the probabilities for the disjoint sets of hidden states Aej , the output labelassociated with the optimal set is returned.In the evaluation presented in [Yu & Lam 08], HDCRFs outperform CRFs on threesequence labeling problems: POS tagging, noun phrase (NP) chunking and namedentity recognition (NER).

7

1.3 Scope of Work

The focus of this thesis is the integration of alignments into the training of ConditionalRandom Fields (CRFs) within the application area of Grapheme-to-Phoneme (G2P)conversion.Compared with [Bisani & Ney 08], the similarity is the consideration of allmatching alignments in training. Due to the fact that we use CRFs withlatent variables, our model will be similar to those models introduced in[Quattoni & Wang+ 07] and [Yu & Lam 08]. The online discriminative training pre-sented in [Jiampojamarn & Cherry+ 08] and [Jiampojamarn & Kondrak 09] intro-duces similar alignment constraints and alignment links to those we utilize. In addi-tion, [Jiampojamarn & Cherry+ 10] and illustrate possible improvements of our CRFmodel.The structure of this thesis is composed as follows: after concluding this chapter withan overview of the related work in the G2P conversion domain, we continue with ageneral introduction to CRFs for the G2P conversion task in Chapter 2. Afterwards,we investigate various alignment integration approaches and constraints, the convexityproblem and the associated issue of the initialization in Chapter 3. In addition, variousfeature functions are discussed in Chapter 4. After presenting the experiments andcorresponding results in Chapter 5, we conclude with a summary and an outlook inChapter 6.

8

Chapter 2

Conditional Random Fields for G2P

In this chapter, we are going to introduce linear chain CRFs for the G2P conver-sion task, including the training of the model parameters using the forward-backwardalgorithm and the inference using the fully trained model.

2.1 Grapheme-to-Phoneme Conversion

As stated in the previous chapter, the task of G2P conversion consists of findingthe most probable pronunciation for an input word. Given the grapheme sequencefJ1 = f1 . . . fj . . . fJ , where each fj is a symbol of an input alphabet Vf , one has tofind the phoneme sequence eI1 = e1 . . . ei . . . eI , where each ei is a symbol of an outputalphabet Ve, with the highest true probability Pr(eI1 | fJ1 ). The chain rule can beapplied in order to decompose Pr(eI1 | fJ1 ) in the following way:

Pr(eI1 | fJ1 ) =I∏i=1

Pr(ei | ei−11 , fJ1 ) (2.1)

Because the true probability Pr(ei | ei−11 , fJ1 ) is generally unknown, it has to be ap-

proximated, which is done by computing a probability model p(ei|ei−11 , fJ1 ):

Pr(ei | ei−11 , fJ1 ) = p(ei | ei−1

1 , fJ1 ) (2.2)

Although the probability of a phoneme ei is conditioned on the previous phonemesei−1

1 and the whole grapheme sequence fJ1 in such a model, it is not conditioned on anyparticular graphemes. This is due to the fact that there is not yet a way of expressingany relation between phonemes and graphemes representing their pronunciation yet.Therefore, when trained on a certain set of words and their pronunciations, the modelwill not be able to learn the probability of a phoneme conditioned on a subsequence ofgraphemes, but only its probability conditioned on the whole word. In other words, thismodel will not able to learn the pronunciation of individual subsequences of graphemes,but only the pronunciation of whole words. Consequently, when applied to infer thepronunciation of words that have not occurred during the training process for themodel, it will perform poorly.

9

In order to describe this more clearly, an illustration, based on the following example,will be given. Suppose that the two pairs of grapheme and phoneme sequences shownin Figure 2.1 appear in the training set. Because the probability of each phoneme

“modelling”[mQdPIN] =

modelling[mQdPIN]

“remodel”[rimQdP] =

remodel[rimQdP]

Figure 2.1. Example of unaligned grapheme-phoneme pairs.

is conditioned on the whole grapheme sequence, i.e. “modelling” or “remodel”, themodel will not learn the pronunciation of their common word stem “model”, which ispresented in Figure 2.2. Hence, the model will not be able to deduce the pronunciationof the word “model” from the examples seen during training, although the subsequence“model” occurs several times in the training data.

“model”[mQdP] =

model[mQdP]

Figure 2.2. Example of an unaligned grapheme-phoneme pair, which is the common wordstem of the examples presented in Figure 2.1. Without an alignment, the model will not beable to deduce the pronunciation of this word from the examples in Figure 2.1.

In order to improve the model, dependencies between graphemes and phonemes have tobe incorporated, meaning there has to be a way of linking phonemes and graphemesthat are related by pronunciation. Therefore, an alignment aJ1 = a1 . . . aj . . . aJ isintroduced as a hidden variable, where aj = i ∈ {1, . . . , I}, if the grapheme fj generatesthe phoneme ei. Finally, we obtain:

Pr(eI1 | fJ1 ) =∑aJ1

Pr(eI1, aJ1 | fJ1 ) (2.3)

It is worth noting that the above definition implies that each grapheme can be alignedat most to one phoneme. This issue will addressed in Section 3.3. Considering thetrue probability

∑aJ1Pr(eI1, a

J1 | fJ1 ), there are two different modelling approaches,

that will be investigated below.

10

2.1.1 Maximum Approach

On the one hand, the true probability Pr(eI1 | fJ1 ) can be approximated by using justthe single best alignment aJ1 , which is computed in a preprocessing step. Hence, theCRF model pΛ(eI1, a

J1 | fJ1 ) has to be computed.


Pr(eI1, aJ1 | fJ1 ) (2.4)

≈ maxaI1

{Pr(eI1, a

J1 | fJ1 )

}(2.5)

= Pr(eI1, aJ1 | fJ1 ) with precomputed

alignment aJ1 = argmaxaJ1

{Pr(eI1, a

J1 | fJ1 )

}(2.6)

2.1.2 Summation Approach

On the other hand, the sum over all possible alignments aJ1 does not have to be ap-proximated, but it can be included in the model. In order to reduce the computationalcosts, which are exponential in the case of an arbitrary alignment, certain constraintscan be imposed in the application domain of G2P. We will look at these different ap-proches in more depth in Chapter 3 and focus in this chapter on the following CRFmodel

∑aJ1pΛ(eI1, a

J1 | fJ1 ):


Pr(eI1, aJ1 | fJ1 ) (2.7)

(2.8)

2.2 Linear Chain CRF

As presented by [Lafferty & McCallum+ 01], a CRF is an undirected graphical modelof a target sequence of random variables eI1, which are conditioned on the observationsfJ1 , that has a log-linear form. Its associated graph G = (V,E) contains one vertexvi ∈ V for each random variable ei. The set of edges E ⊂ V × V is defined in sucha way that each random variable ei depends on ei′ with i 6= i′ if and only if vi andvi′ are neighbours in G, i.e. (vi, vi′) ∈ E. The random variables eI1 obey the Markovprobability with respect to the graph:

p(ei | fJ1 , ei−11 , eIi−1) = p(ei | fJ1 , ei′ with (vi, vi′) ∈ V ) (2.9)

Since the probability distribution is globally conditioned on fJ1 , the label bias problemis solved. In order to allow for an efficient computation using dynamic programming,

11

we make the first order Markov assumption, i.e. we assume that in G there areonly edges between vi−1 and vi for all i ∈ {1, . . . , I}, i.e. that each ei is conditionallyindependent of all vertexes except for ei−1 and ei+1. This results in a linear chain CRFwith G = (V,E) with V = {v1, . . . , vI} and E = {(vi, vi+1) | i ∈ {1, . . . , I − 1}} whichwe are going to use throughout this thesis. Furthermore, we add special symbols estartand eend to mark the beginning and the end of a phoneme sequence and set e0 = estart,eI+1 = eend.Thus we obtain the following model pΛ(eI1 | fJ1 ) corresponding to a linear chain CRF:

pΛ(eI1 | fJ1 ) =1Z

exp( I∑i=1

K∑k=1

λkhk(ei−1, ei, fJ1 ))

(2.10)

Z =∑eI1

exp( I∑i=1

K∑k=1

λkhk(ei−1, ei, fJ1 ))

(2.11)

Here, Z is the normalization constraint, Λ = {λ1, . . . , λK} is the set of feature weights,which represent the model parameters and belong to the set of feature functions{h1, . . . , hK}. In addition, we have to include a dependency of the feature functionson the alignment aJ1 as our goal is to model Pr(eI1, a

J1 | fJ1 ). Because the CRF model

has to be globally normalized, we have to include the sum over all possible alignments∑aJ1

in our normalization term Z.

pΛ(eI1, aJ1 | fJ1 ) =

1Z

exp( I∑i=1

K∑k=1

λkhk(ei−1, ei, aJ1 , f

J1 ))

(2.12)

Z =∑eI1

∑aJ1

exp( I∑i=1

K∑k=1


J1 ))

(2.13)

Finally, we obtain the following models for the approaches presented above:

1. CRF Model for the Maximum Approach:

Pr(eI1 | fJ1 ) ≈ Pr(eI1, aJ1 | fJ1 ) with precomputed

alignment aJ1 = argmaxaJ1

{Pr(eI1, a

J1 | fJ1 )

}(2.14)

= pΛ(eI1, aJ1 | fJ1 ) (2.15)

=1Z

exp( I∑i=1

K∑k=1


J1 ))

(2.16)

12

2. CRF Model for the Summation Approach:


Pr(eI1, aJ1 | fJ1 ) (2.17)

=1Z

∑aJ1

exp( I∑i=1

K∑k=1


J1 ))

(2.18)

In both cases, the normalization constraint Z is the same because the CRF has to beglobally normalized on all phoneme sequences eI1 and alignments aJ1 :

Z =∑eI1

∑aJ1

exp( I∑i=1

K∑k=1


J1 )). (2.19)

2.3 Training

2.3.1 Training Criterion, Algorithm and Regularization

The parameters Λ of our CRF model∑

aJ1pΛ(eI1, a

J1 | fJ1 ) are estimated on a training

corpus S of pairs of pronunciation sequences and words:

S ={

(eI1 , fJ1 )r

∣∣∣ r = 1, . . . , R}. (2.20)

As proposed by [Lafferty & McCallum+ 01], the training criterion consists of the max-imization of the conditional log-likelihood function, which will also be referred to as theobjective function. Because of its monotonicity, the conditional log-likelihood functionyields the same results as the conditional likelihood function.

L(Λ) = log( R∏r=1

∑aJr,r1,r

pΛ(eI1, aJ1 | fJ1 )r

)(2.21)

=R∑r=1

log(∑aJr,r1,r

pΛ(eI1, aJ1 | fJ1 )r

)(2.22)

For reasons of clarity and readability, we will drop the index r in the rest of thischapter, e.g. we will write pΛ(eI1, a

J1 | fJ1 ) instead of pΛ(eI1, a

J1 | fJ1 )r. Even though

L(Λ) is a smooth function, there is no closed form solution. Therefore, we have touse an approximation algorithm and, due to its fast convergence, choose the gradientascent algorithm RPROP - further details can be found in [Riedmiller & Braun 93].

13

Since CRFs tend toward overfitting in general, we impose a regularization term c||λK1 ||2and obtain the following objective function L(Λ, c).

L(Λ, c) =R∑r=1

log(∑

aJ1

pΛ(eI1, aJ1 | fJ1 )

)− c||λK1 ||2 (2.23)

=R∑r=1

(log(∑

aJ1

exp( I∑i=1

K∑k=1


J1 )))

− log(∑

eI1

∑aJ1

exp( I∑i=1

K∑k=1


J1 ))))

− c||λK1 ||2

(2.24)

For the sake of simplicity, we introduce the function H(eI1, aJ1 , f

J1 ) which defines posi-

tion dependent features:

H(eI1, aJ1 , f

J1 ) =

I∑i=1

K∑k=1


J1 ). (2.25)

After substituting H(eI1, aJ1 , f

J1 ) into the objective function we obtain:

L(Λ, c) =R∑r=1

(log(∑

aJ1

expH(eI1, aJ1 , f

J1 ))

− log(∑

eI1

∑aJ1

expH(eI1, aJ1 , f

J1 )))− c||λK1 ||2.

(2.26)

2.3.2 Gradient

Obviously, the gradient of L(Λ, c) can be easily computed using the gradient of L(Λ)since the following holds:

∂L(Λ, c)∂λl

=∂L(Λ)∂λl

− 2c. (2.27)

14

Below, we present the computation of the gradient of L(Λ):

∂L(Λ)∂λl

=R∑r=1

(∂

∂λllog(∑

aJ1

expH(eI1, aJ1 , f

J1 ))

− ∂

∂λllog(∑

eI1

∑aJ1

expH(eI1, aJ1 , f

J1 ))) (2.28)

=R∑r=1

(∑aJ1

expH(eI1, aJ1 , f

J1 )∑

a′J1

expH(eI1, a′J1 , f

J1 )· ∂H(eI1, a

J1 , f

J1 )

∂λl

−∑eI1

∑aJ1

expH(eI1, aJ1 , f

J1 )∑

e′I1

∑a′

J1

expH(e′I1, a′J1 , f

J1 )· ∂H(eI1, a

J1 , f

J1 )

∂λl

) (2.29)

=R∑r=1

(∑aJ1

expH(eI1, aJ1 , f

J1 )∑

a′J1

expH(eI1, a′J1 , f

J1 )·

I∑i=1

hl(ei−1, ei, aJ1 , f

J1 )

−∑eI1

∑aJ1

expH(eI1, aJ1 , f

J1 )∑

e′I1

∑a′

J1

expH(e′I1, a′J1 , f

J1 )·

I∑i=1


J1 )) (2.30)

=R∑r=1

(∑aJ1

pΛ(aJ1 |fJ1 , eI1) ·I∑i=1


J1 )

−∑eI1

∑aJ1

pΛ(eI1, aJ1 |fJ1 ) ·

I∑i=1


J1 )).

(2.31)

As a result we obtain∂L(Λ)∂λl

=R∑r=1

(Nl −Dl

)(2.32)

with the numerator Nl and the denominator Dl

Nl =∑aJ1



J1 ) (2.33)

Dl =∑aJ1

∑eI1


I∑i=1


J1 ). (2.34)

15

2.3.3 Forward-Backward Algorithm

Unfortunately, the formulation of the denominator Dl shown above carries the disad-vantage of inefficiency, which is due to the term

∑eI1pΛ(eI1, a

J1 |fJ1 ). For instance, if

the output vocabulary Ve contains |Ve| ' 50 phonemes and the phoneme sequence isof length I = 7, then one has to compute pΛ(eI1, a

J1 |fJ1 ) |Ve|I ≈ 7.8 · 1011 times.

Obviously, an efficient computation of Dl implies a more efficient composition of theterm

∑eI1pΛ(eI1, a

J1 |fJ1 ). In the following section, we will present a solution to this

problem using the forward-backward algorithm that makes use of dynamic program-ming to efficiently compute the required values.In addition, we introduce the potential function q(ei−1, ei, a

J1 , f

J1 ) for the sake of sim-

plicity. In principle, it expresses the unnormalized probability of a transition fromvertex ei−1 to vertex ei in the CRF given the alignment aJ1 and the grapheme se-quence fJ1 .

q(ei−1, ei, aJ1 , f

J1 ) = exp

( K∑k=1


J1 ))

(2.35)

By substituting the potential function q(ei−1, ei, aJ1 , f

J1 ), we can now rewrite the term∑

eI1pΛ(eI1, a

J1 |fJ1 ).

∑eI1

pΛ(eI1, aJ1 |fJ1 ) =

∑eI1

exp(

I∑i=1

K∑k=1


J1 ))

∑a′

J1

∑e′

I1

exp(

I∑i=1

K∑k=1

λkhk(e′i−1, e′i, a′J1 , f

J1 )) (2.36)

=∑eI1

I∏i=1


J1 )

∑a′

J1

∑e′

I1

I∏i=1

q(e′i−1, e′i, a′J1 , f

J1 )

︸︷︷︸Z

(2.37)

Firstly, we rewrite the denominator Z, hence we introduce the forward vectorsαi(ei−1, a

J1 , f

J1 ) that can be interpreted as the sum of the unnormalized probabilities

of all phoneme sequences ei1 with arbitrary ei−11 but fixed ei.

αi(ei, aJ1 , fJ1 ) =

∑ei−1


J1 ) · αi−1(ei−1, a

J1 , f

J1 ) (2.38)

α1(e1, aJ1 , f

J1 ) = q(e0, e1, a

J1 , f

J1 ) (2.39)

16

We proceed with the actual rewriting of Z using the newly introduced forward vectorsαi(ei−1, a

J1 , f

J1 ).

Z =∑a′

J1

∑e′

I1

I∏i=1

q(e′i−1, e′i, a′J1 , f

J1 ) (2.40)

=∑a′

J1

∑e′I

∑e′I−1

∑e′I−2

. . .∑e′1

I∏i=1

q(e′i−1, e′i, a′J1 , f

J1 ) (2.41)

=∑a′

J1

∑e′I

∑e′I−1

q(e′I−1, e′I , a′J1 , f

J1 ) ·

∑e′I−2

q(e′I−2, e′I−1, a′J1 , f

J1 )·

. . . ·∑e′1

q(e′1, e′2, a′J1 , f

J1 ) · q(e′0, e′1, a′

J1 , f

J1 )︸︷︷︸

α1(e′1,a′J1 ,f

J1 )

(2.42)

=∑a′

J1

∑e′I

∑e′I−1

q(e′I−1, e′I , a′J1 , f

J1 ) ·

∑e′I−2

q(e′I−2, e′I−1, a′J1 , f

J1 )·

. . . ·∑e′1

q(e′1, e′2, a′J1 , f

J1 ) · α1(e′1, a′

J1 , f

J1 )

︸︷︷︸α2(e′2,a′

J1 ,f

J1 )

(2.43)

= . . . (2.44)

=∑a′

J1

∑e′I

∑e′I−1

q(e′I−1, e′I , a′J1 , f

J1 ) · αI−1(e′I−1, a′

J1 , f

J1 )

︸︷︷︸αI(e′I ,a′

J1 ,f

J1 )

(2.45)

=∑a′

J1

∑e′I

αI(e′I , a′J1 , f

J1 ) (2.46)

Secondly, we rewrite∑

eI1

∏Ii=1 q(ei−1, ei, a

J1 , f

J1 ), the numerator of the denominator

term in equation 2.37, by using forward vectors αi(ei−1, aJ1 , f

J1 ) and also backward

vectors βi+1(ei, aJ1 , fJ1 ) that are defined below. They can be interpreted as the sum

of the unnormalized probabilities of all phoneme sequences eIi with arbitrary eIi+1 butfixed ei.

βi+1(ei, aJ1 , fJ1 ) =

∑ei+1

q(ei, ei+1, aJ1 , f

J1 ) · βi+2(ei+1, a

J1 , f

J1 ) (2.47)

βI+1(eI , aJ1 , fJ1 ) = 1 (2.48)

17

After substituting the forward vectors recursively we obtain

αi−1(ei−1, aJ1 , f

J1 ) =

∑ei−2

q(ei−2, ei−1, aJ1 , f

J1 ) · αi−2(ei−1, a

J1 , f

J1 ) (2.49)

= . . . (2.50)

=∑ei−2

q(ei−2, ei−1, aJ1 , f

J1 )·

. . . ·∑e1

q(e1, e2, aJ1 , f

J1 ) · q(e0, e1, a

J1 , f

J1 )

(2.51)

=∑ei−21

i−1∏i′=1

q(ei′−1, ei′ , aJ1 , f

J1 ) (2.52)

and respectively for the backward vectors

βi+1(ei, aJ1 , fJ1 ) =

∑ei+1

q(ei, ei+1, aJ1 , f

J1 ) · βi+2(ei+1, a

J1 , f

J1 ) (2.53)

= . . . (2.54)

=∑ei+1

q(ei, ei+1, aJ1 , f

J1 ) · . . . ·

∑eI

q(eI−1, eI , aJ1 , f

J1 ) (2.55)

=∑eIi+1

I∏i′=i+1

q(ei′−1, ei′ , aJ1 , f

J1 ). (2.56)

By substituting both, forward and backward vectors in the upper notation, we canrewrite the numerator of the denominator term in equation 2.37 as follows:

∑eI1

I∏i=1


J1 ) =

∑e1

. . .∑eI

I∏i=1


J1 ) (2.57)

=∑ei−1,ei

∑ei−21

i−1∏i′=1

q(ei′−1, ei′ , aJ1 , f

J1 ) · q(ei−1, ei, a

J1 , f

J1 )

·∑eIi+1

I∏i′=i+1

q(ei′−1, ei′ , aJ1 , f

J1 )

(2.58)

=∑ei−1,ei

αi−1(ei−1, aJ1 , f

J1 )

· q(ei−1, ei, aJ1 , f

J1 ) · βi+1(ei, aJ1 , f

J1 ).

(2.59)

18

In summary, our starting point was the computation of the gradient

∂L(Λ)∂λl

=R∑r=1

(Nl −Dl

), (2.60)

with the numerator Nl and the denominator Dl:

Nl =∑aJ1



J1 ) (2.61)

Dl =∑aJ1

∑eI1


I∑i=1


J1 ). (2.62)

In order to allow for an efficient computation of the denominator Dl, we rewrote theterm

∑eI1pΛ(eI1, a

J1 |fJ1 ) in the following way:∑

eI1


1Z

∑ei−1,ei

αi−1(ei−1, aJ1 , f

J1 )

· q(ei−1, ei, aJ1 , f

J1 ) · βi+1(ei, aJ1 , f

J1 )

(2.63)

Z =∑a′

J1

∑e′I

αI(e′I , a′J1 , f

J1 ). (2.64)

For this purpose, we introduced the potential function q, forward vectors αi and back-ward vectors βi as defined below:


J1 ) = exp

( K∑k=1


J1 ))

(2.65)

αi(ei, aJ1 , fJ1 ) =

∑ei−1


J1 ) · αi−1(ei−1, a

J1 , f

J1 ) (2.66)

α1(e1, aJ1 , f

J1 ) = q(e0, e1, a

J1 , f

J1 ) (2.67)

βi+1(ei, aJ1 , fJ1 ) =

∑ei+1

q(ei, ei+1, aJ1 , f

J1 ) · βi+2(ei+1, a

J1 , f

J1 ) (2.68)

βI+1(eI , aJ1 , fJ1 ) = 1. (2.69)

The numerator of the gradient Nl can be efficiently computed in a similar way. Sincethe denominator can be represented as a finite state automaton (FSA), the numeratorcan be represented as an FSA containing exactly those states and transitions of thedenominator FSA that correspond to the phoneme sequence eI1. In the followingsection, we will investigate further the complexity and the resulting gain in complexitywhen using the forward-backward algorithm.

19

2.3.4 Runtime Complexity

We will ignore the complexity of the sum over all possible alignments for the timebeing. Thus, without forward-backward algorithm we would have to compute:

∑eI1


∑eI1

1Z

exp( I∑i=1

K∑k=1


J1 ))

(2.70)

Z =∑eI1

∑aJ1

exp( I∑i=1

K∑k=1


J1 )). (2.71)

Due to the term∑

eI1, we would have to compute pΛ(eI1, a

J1 |fJ1 ) |Ve|I times, where

each computation would result in I · K summations. Finally, this would result in acomplexity of

O(|Ve|I · I ·K). (2.72)

Since we are using the forward-backward algorithm, we have to computeαi−1(ei−1, a

J1 , f

J1 ) as well as βi+1(ei, aJ1 , f

J1 ) for every ei−1 or ei respectively, i.e. both

|Ve| times. In each of their O(I) recursions, the current value has to be computedfor |Ve| possible predecessors or successors respectively. Each of these computationsrequires K summations over the feature functions. Altogether, we have a complexityof:

O(|Ve|2 · I ·K). (2.73)

Obviously, this is a significant reduction in complexity as training corpora in languageslike English tend to have about |Ve| = 50 different phonemes and an average phonemesequence length of I = 7.

2.4 Search

As soon as we have trained our model, i.e. estimated the best model parameters Λfor our training corpus, we can apply our model to infer the most probable phonemesequence eI1 for a given grapheme sequence fJ1 , which does not have to have beenseen during training. However, there are two aspects to be pointed out. Firstly, weapproximate the summation over all possible alignments by the alignment that allowsfor the most probable pair of a phoneme sequence and an alignment. Secondly, we candrop the denominator Z since the only goal is determining the most probable phonemesequence, i.e. we can drop the normalization of all probabilities because this does noteffect the inference. According to Bayes’ Decision Rule we obtain:

20

eI1 = argmaxeI1

{Pr(eI1 | fJ1 )

}(2.74)

= argmaxeI1

{∑aJ1

pΛ(eI1, aJ1 |fJ1 )

}(2.75)

≈ argmaxeI1,a

J1

{pΛ(eI1, a

J1 |fJ1 )

}(2.76)

= argmaxeI1,a

J1

{1Z

exp( I∑i=1

K∑k=1


J1 ))}

(2.77)

= argmaxeI1,a

J1

{exp

( I∑i=1

K∑k=1


J1 ))}

. (2.78)

In order to infer effciently, we again make use of dynamic programming and applythe Viterbi algorithm. In principle, we compute the unnormalized scoring of each eiby maximizing over the product of the score of its predecessors and its transitionsand memorize the predecessor that allows for the maximal product. After computingVI+1(eI+1), we apply backtracking and determine the most probable phoneme sequenceeI1. The Viterbi functions are defined as follows:

Vi(ei) = maxei−1

{Vi−1(ei−1) · exp

(λkhk(ei−1, ei, a

J1 , f

J1 ))}

(2.79)

V1(e1) = exp( K∑k=1

λkhk(e0, e1, aJ1 , f

J1 )). (2.80)

21

Chapter 3

Alignments

After having laid the mathematical foundations in terms of the CRF model, includingthe training of its feature weights and its application of inferring the most probablepronunciation of a given grapheme sequence, we will now investigate the topic ofalignments in this chapter.

3.1 Overview

As we have already briefly described in Section 2.1, there is a need to introduce analignment into our model in order to establish a relation between graphemes andparticular phonemes generated by them. Allowing the model to learn probabilitiesconditioned on links between graphemes and phonemes related by pronunciation helpsto improve the accuracy of inferring the pronunciation of words that are unseen intraining. The reason for this is that the model does not have to learn the pronunciationof whole words or some certain rules. Instead, it is able to learn the probabilities ofsubsequences of graphemes that generate particular phonemes for a certain contextof surrounding graphemes and phonemes. Therefore, it has to know explicitly whichphonemes are generated by which graphemes.In the following, we illustrate the importance of introducing an alignment by meansof an example. Figure 3.1 depicts two grapheme-phoneme pairs appearing in thetraining of the parameters of a model that does not take into account any alignmentat all. Since the system has no way of learning which phonemes are produced by whichgraphemes, it is not able to learn anything about the pronunciation of the graphemesubsequence “ow”, which is common to both words.Instead, if we introduce an alignment, we are able to make the same observationsduring training including their alignment, as shown in Figure 3.2.Hence, the model is able to learn that the grapheme subsequence “ow” generates thephoneme [5]. After being fully trained, the model can be used to infer the pronuncia-tion of an unknown word, for instance, it would be able to deduce the pronunciationof the word “knowing” and, in particular, of its subsequence “ow” presented in Figure3.3.

23

“lowly”[l5lI] =

lowly[l5lI]

“elbow”[Elb5] =

elbow[Elb5] .

Figure 3.1. Example of unaligned grapheme-phoneme pairs.

“lowly”[l5lI] =

l[l]

ow[5]

l[l]

y[I]

“elbow”[Elb5] =

e[E]

l[l]

b[b]

ow[5] .

Figure 3.2. Example of possible alignments for the grapheme-phoneme pairs presented inFigure 3.1.

Most corpora do not contain any alignments, because this sort of manual annotationof thousands of words is costly and time consuming, hence the question of how tointroduce such an alignment still remains open. Within this thesis, three differentapproaches for introducing an alignment into a training procedure, that uses an unan-notated corpus, are going to be presented, the first of which consists of applying anexternal model to precompute an alignment before the actual training of the CRFmodel parameters, during which this external alignment is kept fixed. This is thecommon procedure for introducing an alignment, not only in G2P conversion but alsoin translation. It will be further investigated in Section 3.5.As convenient as it sounds, it still has some inherent disadvantages. To begin with,this approach implies a dependency on some external model that has to be used forprecomputation, i.e. without this external tool no alignment can be generated for the

“knowing”[n5IN] =

kn[n]

ow[5]

i[I]

ng[N] .

Figure 3.3. After the training of its parameters on a corpus that includes thegrapheme-phoneme sequence pairs in Figure 3.2, a model can be used to infer thepronunciation of an unknown word, for instance, it would be able to deduce thepronunciation of “knowing” and, in particular, of the subsequence “ow”.

24

CRF model. Consequently, it is necessary to have two different models, which is ratherunattractive particularly with regard to a compact solution. Moreover, this externalalignment is not optimized for our specific CRF model but for the external model. Asa result, it can be the case that the precomputed alignment is the optimal one for theexternal model, even though there might be several far better alignments for our CRFmodel. Apart from that, the external alignment always contains some errors, sinceit has not been manually created but estimated using some form of statistical model.Hence, it introduces errors into our CRF model training from the very beginning.One of the main goals of this thesis is to avoid these disadvantages. Therefore, twodifferent approaches for CRFs with integrated alignments are going to be presented inthe rest of this chapter.On the one hand, resegmentation can be used to precompute an alignment, i.e. afterbeginning with some initial alignment, which is kept fixed, and a model containingonly a small set of feature functions, the model that results after some iterations isused to re-compute the alignment and afterwards the model parameters are reset. Thesame is repeated a certain number of times, each time with the newly computed align-ment remaining fixed, until the error rate on the development corpus and with it thealignment quality converges. Subsequently, the resulting alignment is used as a fixedalignment for the regular CRF model training following the maximum approach (seeSubsection 2.1.1). Although this resegmentation approach eliminates the disadvantageof the dependency on an external model and is optimized for the CRF model, it stillimplies an error propagation. In Section 3.6, this will be presented in more detail.In order to eliminate the error propagation, one may not use a precomputed alignmentat all. The solution presented within this thesis drops the approximation of using themost probable alignment and sums over all matching alignments instead. Thereby,the CRF model is independent of any other external model and contains no errorpropagation. Because the summation is carried out over all matching alignments, theoptimization of the alignment for CRFs is implied. This approach is the focus of thisthesis and will be discussed in Section 3.7.

3.2 Alignment Constraints

We define the alignment aJ1 = a1 . . . aj . . . aJ as mentioned in Subsection 2.1.1, namelyas a function of the source position j with aj = i if and only if the phoneme ei isgenerated by the grapheme fj . Supposing we did not constrain the valid alignments,their number would grow exponentially in the length of the grapheme sequence: for agrapheme sequence fJ1 and a phoneme sequence eI1 we would have I possibilities foreach fj to be aligned to and I + 1 if we allowed the possibility of a grapheme to beunaligned, which results in IJ alignment possibilities. Obviously this is a far too largesearch space, not only for the maximum alignment approach, but for the summation

25

approach as well. For instance, an English word of average length, i.e. J = 8 andI = 7, would have IJ ≈ 5.7 · 106 alignments.In order to solve this problem and obtain a much lower amount of possible alignments,we restrict the alignments in such a way that we exclude those which are thought notto be valid in the sense of a pronunciation relationship, i.e. we adapt our alignments toreal linguistic pronunciation relationships. Needless to say, pronunciation relationshipsbetween graphemes and phonemes vary greatly from language to language. Neverthe-less, since most available corpora for training and testing a G2P conversion tool areEnglish or, much less frequently, French or German, we impose the constraints in ac-cordance with the pronunciation relationships in English and European languages ingeneral.

3.2.1 Monotonicity Constraint

The first constraint to be introduced is the monotonicity constraint, which states thateach grapheme fj in the grapheme sequence fJ1 has to be aligned either to the phonemethat its predecessor fj−1 has been aligned to, or to a phoneme at a later position in thephoneme sequence than the previously aligned phoneme. In other words, the alignedphonemes have to be in the same order as the graphemes they are aligned to:

∀j : aj−1 ≤ aj (3.1)

3.2.2 Representation Constraint

Furthermore, we add the representation constraint, which requires that each phonemehas to be aligned to at least one grapheme, i.e. for each phoneme there has to be atleast one grapheme by which it has been generated:

∀i∃j : aj = i (3.2)

3.2.3 Mapping Constraint

Finally, the mapping constraint is imposed. Because we have defined the alignmentaj as a function of the source position j, this constraint follows directly after thedefinition of the alignment given above. It implies that each grapheme can be alignedto at most one phoneme:

∀j, i, i′ : i 6= i′ y aj 6= i ∨ aj 6= i′ (3.3)

3.3 M-to-1 Alignment

As a result of the combination of all three alignment constraints, we obtain a many-to-one (M-to-1) alignment between graphemes and phonemes, meaning that each

26

phoneme is aligned to at least one grapheme without any alignment links crossingeach other. Hence, each grapheme fj is aligned either to the same phoneme as its pre-decessor, i.e. eaj−1 , or the successor of this phoneme, i.e. eaj−1+1. Thus, the followingholds:

∀j : 0 ≤ aj − aj−1 ≤ 1. (3.4)

Figure 3.4 illustrates the general structure of the M-to-1 alignment with 1-to-0 and1-to-1 alignment links, which corresponds to a (0,1) Hidden Markov Model (HMM)and results from the alignment stepsize aj−aj−1 being either zero or one, as presentedabove. In the graph, each node at a source position j and target position i representsthe grapheme fj being aligned to the phoneme ei, i.e. aj = i. The automaton on theleft side of Figure 3.4 illustrates that at each source position j, the grapheme can beeither aligned to eaj−1 or eaj−1+1.

TA

RG

ET

PO

SIT

ION

SOURCE POSITION

2 31 5 64

i

j

Figure 3.4. General structure of a monotonic M-to-1 alignment with 1-to-0 and 1-to-1alignment links. It corresponds to a (0,1) Hidden Markov Model (HMM). A node at theposition (j, i) represents that the grapheme fj is aligned to the phoneme ei, i.e. aj = i.The automaton on the left side illustrates the possible alignment steps: at each sourceposition j, the grapheme fj can be either aligned to eaj−1 or eaj−1+1.

Due to the fact that our CRF software was originally implemented for a tagging task,it is only able to handle 1-to-1 alignments. In order to allow for its application togetherwith a M-to-1 alignment, each target sequence is extended by named epsilons in thefollowing way: if multiple graphemes f j+mj are aligned to one single phoneme ei, thenm named epsilons εei are inserted after ei and each grapheme fj+1, . . . , fj+m is aligned

27

to one of them in consecutive order. An example is given in Figure 3.5. This comes ata price, since the size of the output alphabet is doubled, i.e. basically there are twiceas many phonemes, which makes the whole conversion more difficult.

w[w]

ay[1]

l[l]

ay[1] =

w[w]

a[1]

y[ε1]

l[l]

a[1]

y[ε1]

Figure 3.5. The CRF software is able to handle only 1-to-1 alignments. Hence, eachphoneme sequence that is shorter than its corresponding grapheme sequence has to beextended. Named epsilons are inserted in cases where multiple graphemes are aligned toone phoneme, such that an 1-to-1 alignment is obtained.

Compared to the alignment constraints that have been introduced in[Jiampojamarn & Kondrak 10], the main difference is, that our model is notable to handle an optional existence of empty graphemes on the source side. Inother words, it simply processes the input sequences the way they are and does notinsert epsilons on its own. Nevertheless, it allows one phoneme to be aligned to asubsequence of consecutive graphemes, as described above. Introducing epsilons onthe target side, i.e. empty phonemes, as is done in [Jiampojamarn & Kondrak 10],may be seen as a special case of our M-to-1 alignment, where all named epsilons εei

are equal, i.e. there is just one single ε.

3.3.1 Extended Grapheme Sequences

At this stage, we do not include empty graphemes on the source side in our imple-mentation of the model, i.e. we stick to M-to-1 alignments. Later, we are going toextend our model, as shown in Section 3.4. But since there are graphemes that gen-erate more than one phoneme, there are sequence pairs where the phoneme sequenceis longer than the grapheme sequence, i.e. J < I. An example of a possible alignment- including 1-to-2 links not contained in our model as described up to now - for sucha case is shown in Figure 3.6.

“expiry”[Iksp2@rI] =

e[I]

x[ks]

p[p]

i[2@]

r[r]

y[I] .

Figure 3.6. Example of a possible alignment including 1-to-2 links for a sequence pair withless graphemes than phonemes, i.e. J < I.

In order to ensure our model is still applicable but unchanged, we have to extendthese grapheme sequences such that both the source and target sequence have thesame length and, therefore, each phoneme is aligned to exactly one grapheme.

28

When applying resegmentation or summation over alignments, we add as many ep-silons as necessary to the source sequence in a preprocessing step, thus we obtainJ = I. The distribution of the epsilons is processed in such a way that the averagedistance between each two epsilons within the sequence, as well as the distance be-tween the first one and the beginning of the sequence, are maximized. Since theseepsilons are inserted before the actual training of the CRF model, i.e. the CRF modelhas to process only sequence pairs with J ≥ I, it can be simply trained using M-to-1alignments. For the previous example, we obtain the alignment shown in Figure 3.7.Obviously, this method yields suboptimal alignments, since the model learns that e.g.“y” is pronounced as [r].


e[I]

x[k]

p[s]

ε[p]

i[2]

r[@]

y[r]

ε[I] .

Figure 3.7. Example of a sequence pair with J < I. Uniformly distributed ε’s are inserted,such that both sequences have the same length, i.e. J = I, resulting an 1-to-1 alignment.

Although there is nothing to be said against this method of extending particulargrapheme sequences in the training corpus, strictly speaking it may not be appliedto extend the grapheme sequences in the development and test corpora, since thisrequires the target sequence length I to be known - which it is not. In Section 3.4,an extension of the CRF model is presented, such that no empty graphemes have tobe inserted at all, i.e. I is not assumed to be known in the evaluation of the modelanymore. Since one of the objectives of this thesis is to present a summary of the wholework, thus including the step-by-step development of our model implementation, wewill investigate aspects of this model further. Nevertheless, one should keep in mindthat it is not fully applicable in this way, i.e. the model affords further extension suchthat it is able to handle sequence pairs with J < I.If we use an external model for the precomputation of the most probable alignment,the procedure varies from case to case. In the case of the joint n-gram G2P modelpresented in [Bisani & Ney 08], which is the external model we are going to use withinthe scope of this thesis, its model parameters have to be trained first. Afterwards,the trained model is applied to compute the most probable phoneme sequences forthe training, development and test corpora. Because the configuration we use impliesa 1-to-1 alignment model, epsilons are inserted on both source and target side. Werewrite these source sequences by replacing the epsilons by named epsilons, dependingon the previous graphemes. Finally, we replace the original source sides of our corporaby these new ones, which contain named epsilons and have a length J ≥ I. Notethat the extension of the source sides of the development and test corpora takes placewithout any information about their target sides, i.e. the references that are unknown

29

during training. Figure 3.8 depicts the alignment for the example presented above.


e[I]

x[k]

εx[s]

p[p]

i[2]

εi[@]

r[r]

y[I] .

Figure 3.8. Example of an alignment for a sequence pair with J < I that has been computedwith the joint n-gram model of [Bisani & Ney 08].

3.3.2 Examples

In cases where the source and target sequence are of the same length, i.e. J = I, thereis exactly one possible alignment, as shown in Figure 3.9.

“sculptor”[skVlpt@R] =

s[s]

c[k]

u[V]

l[l]

p[p]

t[t]

o[@]

r[R] .

Figure 3.9. Example of the 1-to-1 alignment for sequence pair with J = I.

As mentioned above, if the source side is shorter than the target side, i.e. J < I, itis extended by supplementary epsilons when using resegmentation or summation overalignments. Thus, there is just one possible alignment. When using the joint n-grammodel to precompute an external alignment, named epsilons are inserted at the mostlikely positions. Both cases are shown in Figure 3.10.

“elixir”[IlIks@R] =

e[I]

l[l]

i[I]

x[k]

i[s]

r[@]

ε[R]

e[I]

l[l]

i[I]

x[k]

εx[s]

i[@]

r[R]

Figure 3.10. Example of 1-to-1 alignments for a sequence pair with J < I. The first one hasbeen obtained after the insertion of equally distributed epsilons. The second one has beencomputed using the joint n-gram model of [Bisani & Ney 08].

On the other hand, if the target side is shorter than the source side, i.e. J > I, thereare several possible alignments that satisfy the imposed constraints. An example isgiven in Figure 3.11.

30

“realm”[rElm] =

re[r]

a[E]

l[l]

m[m]

r[r]

ea[E]

l[l]

m[m]

r[r]

e[E]

al[l]

m[m]

r[r]

e[E]

a[l]

lm[m]

Figure 3.11. Possible M-to-1 alignments for a grapheme-phoneme sequence pair with J > I.

Given a source sequence and the corresponding target sequence, the set of all possiblealignments can be easily interpreted when illustrated as an automaton where eachtransition corresponds to a particular alignment link between a grapheme f and aphoneme e, depicted as f : [e]. Hence, the alignments in Figure 3.11 are equal to theset of alignments represented by the automaton in Figure 3.12, which illustrates theimplementation of all matching alignments for a grapheme-phoneme sequence pair asan FSA. As stated at the beginning of Section 3.3, the CRF software was designedfor a tagging task and requires 1-to-1 alignments. Hence, named epsilons are insertedafter each phoneme that is aligned to multiple graphemes. This becomes obvious inthe FSA presented in Figure 3.12.

0 1

2

56

3

7

8

4

r:[r]e:[E]

a:[E]

a:[l]

l:[l]

l:[m]

m:[m]

e:[εr]a:[εE]

l:[εl]

m:[εm]

Figure 3.12. Representation of the grapheme-phoneme sequence pair and all matchingM-to-1 alignments presented in Figure 3.11 as an FSA in the CRF implementation.

31

3.3.3 Search Space

After having presented the M-to-1 alignment, which results from the constraints im-posed in Section 3.2, as well as some illustrating examples, the question arises as towhat practical advantage are obtained by imposing these constraints, i.e. to whichvalue the number of possible alignments - in the unconstrained case IJ - for a givengrapheme sequence fJ1 and a phoneme sequence eI1 has been reduced. In order toanswer this question, we reformulate the problem as shown below.Because the first grapheme has to be aligned to the first phoneme, i.e. a1 = 1 for everyalignment aJ1 , the first state of the alignment automaton does not have to be takeninto account. The rest of the automaton can be mapped onto a board with J + 1− Irows and I columns - the lower left field corresponding to the second node in theautomaton and the upper right one corresponding to the final state in the automaton.Moreover, only transitions to the right or upward transitions are allowed. Figure 3.13illustrates the resulting board for the automaton in Figure 3.12.

e[r] → a

[E] → l[l] → m

[m]↑ ↑ ↑ ↑r[r] → e

[E] → a[l] → l

[m]

Figure 3.13. Reformulation of the alignment problem for the example presented in Figure3.11. The set of possible alignments for the grapheme-phoneme sequence pair correspondsto all possible paths from the lower left to the upper right node of the board.

The total number of alignments is then equal to the number of paths from the lower leftto the upper right field, which can be computed recursively in the following way: eachfield receives a token corresponding to the number of paths leading from this field to theupper right one. Thus, knowing the tokens of the right and upper successors of a field,one can compute the token of this field by simply summing up the tokens of its twosuccessors. Obviously, all fields in the upper row and the right column receive a tokenof 1, since there is exactly one path leading to the upper right field. Consecutively,the tokens for the second, third, right up to the last row are computed from right toleft. Finally, one obtains the token for the lower left field, which corresponds to thenumber of possible alignments for fJ1 and ei1. For the example above, we obtain theresult in Figure 3.14. This equates to Pascal’s triangle, where the upper right fieldscorresponds to the triangle’s top. Finally we obtain the following result:

Number of possible M-to-1 alignments =(J − 1I − 1

). (3.5)

32

e[r]

1

→ a[E]

1

→ l[l]

1

→ m[m]

1

↑ ↑ ↑ ↑r[r]

4

→ e[E]

3

→ a[l]

2

→ l[m]

1

Figure 3.14. Reformulation of the alignment problem for the example presented in Figure3.11. Each node is annotated with the number of possible paths leading to the upper rightnode. All fields in the upper row and the right column receive a token of 1. The token of afield equals the sum of the tokens of its two successors. Thus, the token of the lower leftfield can be computed recursively.

3.4 1-to-2 Alignment Links

As we have already illustrated in Figure 3.6, a pure M-to-1 alignment is not appli-cable without further extension of grapheme sequences, which are shorter than theircorresponding phoneme sequences. In order to achieve this, we have described twodifferent although similar methods: insertion of uniformly distributed epsilons, whenusing the resegmentation or summation approach, as well as named epsilons at themost likely positions, when precomputing an alignment using the joint n-gram modelof [Bisani & Ney 08].The joint n-gram aligner has the clear disadvantage of introducing a dependency onan external model, whereas the uniform distribution has the even worse disadvantage- as shown in 3.6 - of learning wrong dependencies between graphemes and phonemesin most cases where J < I. In addition, when applying the fully trained model inthe search to compute the most probable phoneme sequence for a given graphemesequence, the model will fail all cases in which it should produce a phoneme sequencelonger than the grapheme sequence.In order to avoid wrong dependencies in both the resegmentation and the summa-tion approach and to allow for the generation of phoneme sequences longer than thegrapheme sequences in the search, a grapheme has to be able to be linked to twophonemes where necessary. For this reason, we extend the alignment as follows:

aj ={

(i), if ei generated by fj(i, i+ 1), if ei and ei+1 generated by fj

(3.6)

Including the new definition of the alignment aJ1 , Figure 3.17 illustrates the generalstructure of the M-to-1 alignment with 1-to-0, 1-to-1 and 1-to-2 alignment links. Forthe sake of simplicity, we have chosen this depiction, which corresponds to a (0,1,2)HMM, although one has to keep in mind that a grapheme fj+1, in the case that itspredecessor fj is aligned to the phoneme ei, being aligned to the phoneme ei+2 is alsoimplicitly aligned to ei+1.

33

TA

RG

ET

PO

SIT

ION

SOURCE POSITION

2 31 5 64

i

j

Figure 3.15. General structure of a monotonic M-to-1 alignment with 1-to-0, 1-to-1 and1-to-2 alignment links corresponding to a (0,1,2) Hidden Markov Model (HMM). A node atthe position (j, i) represents that the grapheme fj is aligned to the phoneme ei. Theautomaton on the left side illustrates the possible alignment steps. Note that a graphemefj+1, in the case that its predecessor fj is aligned to the phoneme ei, being aligned to thephoneme ei+2 is also implicitly aligned to ei+1.

Figure 3.16 illustrates an alignment that includes 1-to-2 alignment links for agrapheme-phoneme sequence pair with J < I. In order to introduce 1-to-2 alignmentlinks into our CRF software, which requires an 1-to-1 alignment, we have developedthe software further by allowing for the insertion of an optional named epsilon aftereach grapheme in the source sequence. For the example in Figure 3.16, there has tobe inserted one named epsilon to obtain an 1-to-1 alignment.

“box”[bQks] =

b[b]

o[Q]

x[ks]

Figure 3.16. Example of a possible alignment including 1-to-2 links for a grapheme-phonemesequence pair with J < I.

Because we also allow for optional named epsilons on the target side, such that weare able to model M-to-1 alignments with our CRF software, there may be insertedeven two or more named epsilons on the source side of the example in Figure 3.16,

34

if the corresponding number of named epsilons is inserted on the target side, too.Nevertheless, we allow only one single named epsilon after each grapheme, i.e. wepersume that 2 · J ≥ I. Figure 3.17 illustrates the FSA that corresponds to thegrapheme-phoneme sequence pair and all its matching alignments (including 1-to-2links) in Figure 3.16.

0

1

2

19

20

22

3 4

5

6

7

811

16

13

17

b:[b]

o:[k]

o:[Q]

o:[k]

x:[k]

x:[k]

x:[s]

b:[b]

o:[Q]

εb:[Q]

o:[εb]

εo:[Q]

εo:[k]

o:[εQ]

o:[εQ]

εo:[s] x:[εs]

εo:[k]

εx:[s]

εx:[s]x:[εk]

Figure 3.17. Representation of the grapheme-phoneme sequence pair and all matchingM-to-1 alignments including 1-to-2 links presented in Figure 3.16 as an FSA in the CRFimplementation.

The 1-to-2 alignment links in our CRF software can be easily extended by 1-to-n align-ment links with n ≥ 3. The introduction of 1-to-2 alignment links has approximatelydoubled the computation time of the CRF software, because it implies twice as manynodes in each FSA as without 1-to-2 links. Since the inclusion of 1-to-n links impliesn times more nodes than the implementation including only 1-to-1 alignment links, wepersume that the computation time is linear in n, where n is the maximum numberof phonemes that can be aligned to one single grapheme.

3.5 External Alignment

The first approach for introducing an alignment into the CRF training to be presentedin detail is the external alignment approach, which provides an alignment aN1 beforethe actual CRF training using an external model. In general, the alignment computedusing this external model is not optimized for the CRF model but for the externalmodel itself. Within the scope of this thesis, we use the G2P joint n-gram modelintroduced in [Bisani & Ney 08], since in preliminary experiments it has proven itselfas the most promising one. Once the external alignment has been computed, weobtain the CRF model pΛ(eI1, a

J1 |fJ1 ) according to the maximum approach described

35

in Subsection 2.1.1:

Pr(eI1|fJ1 ) =∑aJ1

Pr(eI1, aJ1 |fJ1 ) (3.7)

= pΛ(eI1, aJ1 |fJ1 ), with external alignment aJ1 (3.8)

=1Z

exp( I∑i=1

K∑k=1


J1 )), (3.9)

Z =∑eI1

∑aJ1

exp( I∑i=1

K∑k=1


J1 )). (3.10)

During the training of the CRF model, its parameters Λ = {λK1 } are optimized asdescribed in Section 2.3 - Figure 3.18 illustrates the training procedure.

Train corpus

eI1, fJ1External alignment model

CRF model pΛ(eI1, aJ1 |fJ1 )

Search on dev/testcorpora using pΛ

alignment aJ1

eI1, fJ1


Train parameters Λ of

Figure 3.18. CRF training using an external alignment.

36

3.6 Resegmentation

In order to eliminate the dependency on the external model, as well as to introducean alignment which is optimized for the CRF model, resegmentation can be appliedto compute the most probable alignment. Firstly, a linear segmentation, i.e. an equaldistribution of alignment links, is computed. Obviously, this does not require anyadditional information - an example is given in Figure 3.19. As can be easily seen inthis example, the linear segmentation is suboptimal, since subsequences of graphemes,which produce one single phoneme, are separated such that the discrete graphemesare aligned to different phonemes. For instance, “o” and “r” should be aligned to [$]and “t” and “h” to [D].

“northern”[n$DH] =

no[n]

rt[$]

he[D]

rn[H] .

Figure 3.19. Example of a linear segmentation.

Next, the resegmentation, initialized with the linear segmentation, is processed. Beingsimilar to the EM-algorithm, it consists of a maximization (M step) and an estimationstep (E step). During the M step, the model parameters Λ′ = {λ′K1 } of a reducedmodel, which has only rudimentary features, are optimized for a certain number ofiterations according to the maximum approach in Chapter 2.1.1. After the i-th iter-ation, with i ∈ {5, 10, 15, 25, 35, 45, 65, 85, 105}, the M step is followed by the E step,during which the current CRF model pΛ′ is used to resegment the training corpus.Because of the reduced feature set, the trained model only includes little informa-tion about the alignment, which allows the resegmentation to differ from the previousalignment. Afterwards, the feature weights Λ′ are reset to zero, such that no impliedinformation about the previous alignment is kept, otherwise the next resegmenationwould be hardly noticeable. Moreover, the old segmentation is replaced by the newone and the next M step takes place. Note that the iteration numbers for the E stephave been determined empirically.As soon as the alignment quality converges after 105 iterations, which also impliesthe convergence of the error rate, the resegmentation is completed. Hence, the lastsegmentation aJ1 , resulting from the E-step in the 105th iteration, is used as a fixedalignment for the actual training of the full CRF model pΛ(eI1, a

J1 |fJ1 ). It consists

of the full feature set, which will be described in the next chapter, and is trainedover 50 iterations according to the maximum approach described in Subsection 2.1.1.Again, the parameters Λ are optimized as described in Section 2.3. The whole trainingprocedure of the resegmentation approach is depicted in Figure 3.20.

37

alignment aJ1


CRF model pΛ′(eI1, a′J

1 |fJ1 )eI1, f

J1

Search on dev/test

Linear Segmentation Train corpus


corpora using pΛ

a′J

1 eI1, fJ1

Resegment train corpus

resegmentation a′J

1

Train param. Λ′ of reducedCRF model pΛ′(eI1, a′

J

1 |fJ1 )

Train parameters Λ of full

Figure 3.20. CRF training using resegmentation.

3.7 Summation over Alignments

Within the scope of this thesis, the main focus is on this last approach of summatingover all matching alignments between a grapheme sequence fJ1 and a phoneme sequenceeI1. There is no preprocessing step in which a single best alignment is computed, butthe summation is processed within the CRF model training (see Section 2.3) according

38

to the summation approach in Subsection 2.1.2:


Pr(eI1, aJ1 | fJ1 ) (3.11)

=∑aJ1

pΛ(eI1, aJ1 | fJ1 ) (3.12)

=1Z

∑aJ1

exp( I∑i=1

K∑k=1


J1 ))

(3.13)

Z =∑eI1

∑aJ1

exp( I∑i=1

K∑k=1


J1 )). (3.14)

The main advantage in comparison to the resegmentation approach is that since nofixed alignment is computed beforehand, there is no error propagation. Furthermore,the additional computational costs are insignificant, as shown in Subsection 3.3.3. InFigure 3.11, the possible M-to-1 alignments for an example with J > I have beenshown, as well as the corresponding implementation of all matching alignments as anautomaton in Figure 3.12. An overview of this quite basic training procedure is givenin Figure 3.21.

Train corpus

eI1, fJ1

Train parameters Λ of

Search on dev/testcorpora using pΛ



Figure 3.21. CRF training using an external alignment.

39

3.8 Convexity

Until now, we have not mentioned the convexity problem regarding the three differentalignment approaches. In the first instance, the training of the CRF model parametersis convex, i.e. it guarantees to reach a global optimum for a fixed alignment, see[Lafferty & McCallum+ 01]. As soon as one introduces the alignment as a latentvariable in training, there is no longer any guarantee of reaching a global optimum,i.e. the optimization problem is not convex.Consequentially, since it implies a fix alignment, the external alignment approachallows for a convex optimization of the feature weights Λ.When using resegmentation, one has to distinguish between the resegmentation it-self and the actual CRF model training. During the M steps of the resegmentation,convexity is given for the current alignment that is kept fixed. Once the alignmentis re-estimated during the E step, the optimization is not convex anymore, since thealignment has changed. Nevertheless, the next M steps themselves imply a convex op-timization with rspect to the new alignment. After the resegmentation, the alignmentis kept fixed during the actual CRF model training, i.e. convexity is guaranteed withrspect to this last alignment.On the other hand, if the summation approach is used instead of the maximum align-ment approach, the alignment is introduced as a latent variable. Thus, there is nofixed alignment. Although the training of the model parameters converges, there is noguarantee of reaching a global optimum, i.e. the optimization may get stuck in a localoptimum. Because of this non-convex optimization, the question whether one reachesthe global optimum or just a local one depends on the initialization of the model pa-rameters. This inherent problem of the summation approach will be addressed in thenext section.

3.9 Initialization

Due to the fact that the summation approach guarantees no convex optimization of themodel parameters, it depends on their initialization as to whether their optimizationgets stuck in a local optimum or not. Therefore, different initialization methods havebeen tested.Firstly, the initialization used in most experiments is the zero model, which meansthat every feature weight λk has been initialized with 0.Secondly, with the expectation of improving results, the model obtained from an ex-periment according to the resegmentation approach has been used as the initial modelfor a summation over all matching alignments experiment.Moreover, we have used the probabilities of an IBM Model 1, which has been trainedon the same training corpus, to initialize the corresponding feature weights in the CRF

40

model. In this case, all other weights have been initialized with zero.In addition, the feature weights have been initialized with the relative frequencies oftheir corresponding features in the training corpus.This last initialization has been extended by initializing only particular features, forinstance, only features depending on the current grapheme context and the outputphoneme or even only a grapheme at a certain position and the current phonemeoutput. In the next chapter, different kinds of features that have been used will bepresented, followed by the results obtained using different initializations and featuresin Chapter 5.

41

Chapter 4

Features

Based on the CRF model introduced in Chapter 2 and the alignments presented inChapter 3, this chapter addresses the feature functions used in the CRF model. Forthe initial experiments, we have used the CRF model and the corresponding featurefunctions that were applied in the tagging domain previously. In order to achieve ahigher performance, we have extended this first model by implementing new featurefunctions sequentially, resulting in CRF Models 1 to 4. Each of these models is afurther development of its predecessor and detailed descriptions will be given in thefollowing sections. Only CRF Model 5 follows a different approach, as will be describedin Section 4.6. In Section 2.2, we introduced the CRF model for the summationapproach as follows:


Pr(eI1, aJ1 | fJ1 ) (4.1)

=1Z

∑aJ1

exp( I∑i=1

K∑k=1


J1 )). (4.2)

In fact, the CRF implementation differs somewhat from the mathematical model, sinceinside the software, the sum over all source positions j is processed instead of the sumover all target positions i. This results from the fact that the software was initiallydesigned to be applied in concept tagging. In this domain, source and target sequencehave the same length, i.e. J = I, and there is a 1-to-1 alignment between them.Summing over all source positions j instead of target positions i implies the advantagethat the implementation of the model can be used easily to produce the most probabletarget sequence for a given source sequence, since the source sequence is known, incontrast to the target sequence.Accordingly, the feature functions hk are dependent on eaj and eaj−1 instead of ei andei−1, i.e. the features fire dependent on the phonemes aligned to the current graphemefj . This results in a potentially higher number of features, since it might be the casethat for some e ∈ Ve, there is no phoneme sequence eI1 with ei = e and ei−1 = e, hencethere is no feature hk(e, e, aJ1 , f

J1 ), but if two phonemes fj and fj−1 are aligned to the

same phoneme eaj = e, there is such a feature. Nevertheless, the difference in the

43

number of features is insignificant. Due to the fact that this change is also applied toZ, the normalization is preserved.In short, replacing the sum over target position by the sum over source positions hasthe advantage that the implementation of the CRF Model can be applied in both thetraining and the search in a similar way.Moreover, the feature functions are not conditioned on the whole alignment aJ1 inour implementation of the CRF model, but only on the previous two alignment steps(aj−1 − aj−2) and (aj − aj−1), for which we use the following abbreviatory notation:

δj = (aj − aj−1). (4.3)

In the case of a pure M-to-1 alignment, it follows that δj ∈ {0, 1}. If additional 1-to-2 alignment links - as described in Section 3.4 - are used, the upper definition of δjremains unchanged, since we introduce optional named epsilons in the implementationand impose that none of them is aligned to the same phoneme as its predecessor.Consequently, we will define the features for a pure M-to-1 alignment.All features are generated before the first training iteration by parsing through allsequence pairs in the training corpus, i.e. the CRF model contains no features otherthan those observed in the training corpus. The feature generation itself is determinedby a definition of feature types, so-called templates. During the feature generation, allexamples in the training corpus are parsed and for each position j and every template itis checked which feature, i.e. which entity of the corresponding template, is occurring.If the occurring feature does not exist yet, it is created, which means that its template,its context Ck and its unique index k are saved. The type of the context depends onthe template, e.g. lexical features have other kinds of contexts than transition features.

4.1 Tagging CRF Model

As a starting point, we apply the implementation of the CRF model that has beenpreviously used in the tagging domain to the G2P conversion task. Below, we describethe different types of feature functions hk, i.e. feature templates, in this implemen-tation. Each of these binary functions fires, i.e. outputs 1, if and only if a certaincontext is given.

4.1.1 Single Lexical Feature

The first feature type to be defined is the single lexical feature, which models the depen-dency of a phoneme eaj and an alignment step δj on a single grapheme inside a certainsource context window f j+δlex

j−δlex, i.e. a grapheme fj+d with d ∈ {−δlex, . . . ,+δlex}. It is

44

defined as

hk : Ve × {0, 1} × Vf → {0, 1}

(eaj , δj , fj+d) 7→{

1, if eaj = e, δj = δ, fj+d = f0, otherwise

(4.4)

for the context Ck = (e, δ, f) ∈ Ve × {0, 1} × Vf . If not stated otherwise, the lexicalcontext window is limited to δlex = 5. The setting of this parameter has been optimizedempirically in preliminary experiments.

4.1.2 Combined Lexical Feature

In addition, we define combined lexical features, which express the dependency of aphoneme eaj and an alignment step δj on a grapheme sequence f j+d2j+d1

. The indexes d1

and d2 are limited in our experiments: we impose that all of the consecutive graphemesare within the lexical window described above, i.e. we introduce lower and upperboundaries −δlex and +δlex with −δlex ≤ d1 < d2 ≤ +δlex, and require that the lengthof the grapheme sequence does not exceed δcmb, i.e. d2− d1 < δcmb. For some δlex andδcmb, we define all combined lexical features that match both criteria, e.g. for δlex = 5and δcmb = 5, we define a combined lexical feature for each consecutive sequence of atleast two and at most five graphemes that does not exceed the lexical window of f j+5

j−5 .The combined lexical feature is defined as

hk : Ve × {0, 1} × V d2−d1+1f → {0, 1}

(eaj , δj , fj+d2j+d1

) 7→

{1, if eaj = e, δj = δ, f j+d2j+d1

= f ′d2−d1+11

0, otherwise

(4.5)

for the context Ck = (e, δ, f ′d2−d1+11 , d1, d2) ∈ Ve×{0, 1}×V d2−d1+1

f ×Z2. For δlex = 5,the δcmb has been optimized empirically in preliminary experiments. Unless statedotherwise, we will use δcmb = 5.

4.1.3 Transition Feature

The third feature template, which is contained in the Tagging CRF Model, is thetransition feature. It models the dependency of a phoneme eaj and an alignment stepδj on their predecessors eaj−1 and δj−1 and is defined as

hk : Ve × {0, 1} × Ve × {0, 1} → {0, 1}

(eaj , δj , eaj−1 , δj−1) 7→{

1, if eaj = e, δj = δ, eaj−1 = e′, δj−1 = δ′

0, otherwise(4.6)

for the context Ck = (e, δ, e′, δ′) ∈ Ve × {0, 1} × Ve × {0, 1}.

45

4.1.4 Decoupling of Alignments and Target Sequences

As mentioned previously, since the CRF implementation has been originally designedfor the application domain of concept tagging, it requires a 1-to-1 alignment. Thesolution that has been applied is to replace each output sequence eI1 by a tag sequencetJ1 and to align each fj to tj , i.e. by inserting named epsilons on the target side asdescribed in Section 3.3. Hence, a tag tj that is aligned to the input fj corresponds toboth the aligned output eaj and the alignment step δj . Further details can be foundin [Hahn & Lehnen+ 09].In order to achieve a higher performance, we have developed the CRF software further,such that the target sequence and the alignment can be decoupled from each other.This allows for the definition of new feature functions, e.g. lexical features independentof the alignment step or distinct bigram and alignment features instead of a commontransition feature, which will be presented in the following sections.

4.2 CRF Model 1

Firstly, we implement new lexical features, which are independent of the alignmentstep δj , and keep the previous transition features unchanged (see Subsection 4.1.3).The lexical features included in CRF Model 1 are described below. Because CRFModels 2, 3 and 4 contain the same single and combined lexical features as CRFModel 1, we will present only new implementations of the transition feature in thecorresponding Sections 4.3, 4.4 and 4.5.


The single lexical feature models the dependency of a phoneme eaj on a singlegrapheme fj+d with d ∈ {−δlex, . . . ,+δlex}. It is defined as

hk : Ve × Vf → {0, 1}

(eaj , fj+d) 7→{

1, if eaj = e, fj+d = f0, otherwise

(4.7)

for the context Ck = (e, f) ∈ Ve×Vf . Unless stated otherwise, δlex is limited to 5, dueto the optimization in preliminary experiments.


The combined lexical feature expresses the dependency of a phoneme eaj on a graphemesequence f j+d2j+d1

. The indexes d1 and d2 are limited and, for some δlex and δcmb, all

46

combined lexical features that match the criteria are defined - as presented in theTagging CRF Model. The combined lexical feature is defined as

hk : Ve × V d2−d1+1f → {0, 1}

(eaj , fj+d2j+d1

) 7→

{1, if eaj = e and f j+d2j+d1

= f ′d2−d1+11

0, otherwise

(4.8)

for the context Ck = (e, f ′d2−d1+11 , d1, d2) ∈ Ve × V d2−d1+1

f × Z2. Unless stated other-wise, we have set δcmb to 5 in the experiments presented in the next chapter, due tothe same reasons as in Subsection 4.1.2.

4.3 CRF Model 2

Secondly, in place of the previous transition feature, we introduce two different ones.This allows for the independent modelling of the currently aligned phoneme eaj andthe current alignment step δj .

4.3.1 Alignment Feature

The alignment feature models the dependency of an alignment step δj on the previouslyaligned phoneme eaj−1 and the previous alignment step δj−1. It is defined as

hk : {0, 1} × Ve × {0, 1} → {0, 1}

(δj , eaj−1 , δj−1) 7→{

1, if δj = δ, eaj−1 = e, δj−1 = δ′

0, otherwise(4.9)

for the context Ck = (δ, e, δ′) ∈ {0, 1} × Ve × {0, 1}.

4.3.2 Bigram Feature

In addition to the alignment features, CRF Model 1 contains bigram features, whichmodel the dependency of a phoneme eaj on its predecessor eaj−1 and the previousalignment step δj−1. It is defined as

hk : V 2e × {0, 1} → {0, 1}

(eaj , eaj−1 , δj−1) 7→{

1, if eaj = e, eaj−1 = e′, δj−1 = δ0, otherwise

(4.10)

for the context Ck = (e, e′, δ) ∈ V 2e × {0, 1}.

47

4.4 CRF Model 3

Although we have two separate transition features in the previous model, both aredependent on the previously aligned phoneme and the previous alignment step aswell. In this model, the dependency of the alignment feature on the previously alignedphoneme and the dependency of the bigram feature on the previous alignment stepare dropped. As a result, we obtain the features defined below.


Since the dependency on the previously aligned phoneme eaj−1 is dropped, this featuremodels the dependency of an alignment step δj on the previous alignment step δj−1

and is defined as

hk : {0, 1}2 → {0, 1}

(δj , δj−1) 7→{

1, if δj = δ, δj−1 = δ′

0, otherwise(4.11)

for the context Ck = (δ, δ′) ∈ {0, 1}2. Note that there are only four possible alignmentfeatures.


The dependency of a phoneme eaj on its predecessor eaj−1 is modelled by this bigramfeature, which is defined as

hk : V 2e → {0, 1}

(eaj , eaj−1) 7→{

1, if eaj = e, eaj−1 = e′

0, otherwise(4.12)

for the context Ck = (e, e′) ∈ V 2e .

4.5 CRF Model 4

Compared to CRF Model 3, the lexical and alignment features remain unchanged, butthe bigram feature is developed further, as shown below.


In order to develop the bigram feature in CRF Model 3 further, we add the additionalcondition that the current alignment step has to be non-zero, i.e. the current phonemeis not aligned to the same phoneme as its predecessor. This requirement is equivalent to

48

a definition of a bigram feature dependent on phonemes instead of aligned phonemes.Hence, the bigram feature is defined as

hk : V 2e → {0, 1}

(ei, ei−1) 7→{

1, if ei = e, ei−1 = e′

0, otherwise(4.13)

for the context Ck = (e, e′) ∈ V 2e . Note, that the combination of the lexical, align-

ment and bigarm features in CRF Model 4 is similar to a discriminative HMM (see[Heigold & Wiesler+ 10]).

4.6 CRF Model 5

Finally, we have implemented an extension of CRF Model 3, which contains newlexical, alignment and bigram features in addition to those already contained in CRFModel 3. Below, we present only the new features.


This additional single lexical feature models the dependency of the current alignmentstep δj on a single grapheme fj+d with d ∈ {−δlex, . . . ,+δlex}. It is defined as

hk : {0, 1} × Vf → {0, 1}

(δj , fj+d) 7→{

1, if δj = δ, fj+d = f0, otherwise

(4.14)

for the context Ck = (δ, f) ∈ {0, 1} × Vf . Unless stated otherwise, δlex is limited aspresented in Subsection 4.2.1.


The dependency of the current alignment step δj on a grapheme sequence f j+d2j+d1is

expressed by this additional combined lexical feature. The indexes d1 and d2 arelimited as before and, for some δlex and δcmb, all combined lexical features that matchthe criteria are defined. The additional combined lexical feature is defined as

hk : {0, 1} × V d2−d1+1f → {0, 1}

(δj , fj+d2j+d1

) 7→

{1, if δj = δ and f j+d2j+d1

= f ′d2−d1+11

0, otherwise

(4.15)

for the context Ck = (δ, f ′d2−d1+11 , d1, d2) ∈ {0, 1} × V d2−d1+1

f × Z2. Unless statedotherwise, we have set δcmb to 5 as presented in Subsection 4.2.2.

49


The additional alignment feature models the dependency of an alignment step δj onthe previously aligned phoneme eaj−1 . It is defined as

hk : {0, 1} × Ve → {0, 1}

(δj , eaj−1) 7→{

1, if δj = δ, eaj−1 = e0, otherwise

(4.16)

for the context Ck = (δ, e) ∈ {0, 1} × Ve.


The bigram feature models the dependency of a phoneme eaj on the previous alignmentstep δj−1 and is defined as

hk : Ve × {0, 1} → {0, 1}

(eaj , δj−1) 7→{

1, if eaj = e, δj−1 = δ0, otherwise

(4.17)

for the context Ck = (e, δ) ∈ Ve × {0, 1}.

4.7 Summary

In Table 4.1, we present an overview of the CRF models and their features, whichwere described in the previous subsections. For the sake of readability, the singlelexical features are not listed, but their form is quite obvious since they correspondto a combined lexical feature with d1 = d2. The CRF models are listed in the sameorder as they were developed and the parameters that are changed in order to obtainthe next model are highlighted in colour.In summary, we have dropped the dependency of the lexical features on the alignmentstep δj of our initial CRF model implemented for concept tagging, resulting in CRFModel 1. Moreover, we have decoupled δj and eaj in the transition feature of CRFModel 1, thus obtaining CRF Model 2 with two different transition features: an align-ment and a bigram feature. CRF Model 3 results from dropping the dependency ofthe alignment feature on the previously aligned phoneme and the dependency of thebigram feature on the previous alignment step in CRF Model 2. The further restrictionof the bigram feature in CRF Model 3, which is implied by requiring that the bigramfires only if aj > aj−1, i.e. if δj = 1, results in CRF Model 4. Finally, CRF Model 5contains all features of CRF Model 3, as well as the additional features presented in4.1.

50

Table 4.1. Overview of the distinct CRF models and their features, presented in the sameorder as they were developed. The parameters that need to be changed in order to obtainthe next model are highlighted in colour. Single lexical features are not listed, but theirform corresponds to a combined lexical feature with d1 = d2.

Features hk

Combined lexical Alignment Bigram

Tagging CRF Model (eaj , δj , fj+d2j+d1

) (eaj , δj , eaj−1 , δj−1)

CRF Model 1 (eaj , fj+d2j+d1

) (eaj , δj , eaj−1 , δj−1)


) (δj , eaj−1 , δj−1) (eaj , eaj−1 , δj−1)


) (δj , δj−1) (eaj , eaj−1)


) (δj , δj−1) (ei, ei−1)


) (δj , δj−1) (eaj .eaj−1)

(δj , fj+d2j+d1

) (δj , eaj−1) (eaj , δj−1)

51

Chapter 5

Experiments and Results

The practical application of the CRF models and alignments introduced in the previouschapters is the subject matter for this chapter. After introducing the corpora that wehave used to train and evaluate our models, we proceed with the presentation of thecorresponding experiments and their discussion.

5.1 Performance Metrics

In G2P conversion, performance is measured by means of phoneme error rate (PER) aswell as word error rate (WER). The former is defined as the percentaged Levenshteindistance between the conversion result (hypothesis) and the correct pronunciation (ref-erence) divided by the length of the reference, whereas the Levenshtein distance itselfis defined as the minimum number of operations - consisting of insertions, deletionsand substitutions - that need to be applied to the hypothesis in order to obtain thereference [Levenshtein 66]. The WER is defined as the percentage of words with atleast one phoneme error. Our baseline is the results obtained by [Bisani & Ney 08].

5.2 Corpora

In order to train and evaluate the CRF models and alignments presented in the pre-vious chapters, we have used the corpora shown in Table 5.1: Celex, a British Englishcorpus and NETtalk 15k, an American English corpus. We have chosen only Englishcorpora because when compared to other languages such as French or German, thepronunciation in English is far more irregular. Therefore the results obtained in G2Pconversion on English corpora still have room for improvement, in contrast to Frenchor German corpora, due to their very regular pronunciations. |Vf | and |Ve| denotethe size of the grapheme and phoneme set respectively. Note that in our implemen-tation, we have to double the size of the phoneme set due to the named epsilons, asdescribed in Section 3.3. The corpora have an average grapheme sequence length of Jand an average phoneme sequence length of I. In addition, we list the average num-ber of pronunciations per word in the training set, as well as the size of the training,

53

Table 5.1. G2P corpora statistics.

symbols words prons/ number of words|Vf | |Ve| J I words train dev test

NETtalk 15k 26 50 7.3 6.2 1.01 13804 1071 4951

Celex 26 53 8.4 7.1 1.00 39995 5000 15000

development and test sets. The main difference between these two corpora is thatNETtalk 15k contains so-called double phonemes in place of two phonemes that arerelated to one grapheme in terms of pronunciation. Therefore, each phoneme sequencein NETtalk 15k is shorter than or equally as long as its corresponding grapheme se-quence. Moreover, there are diverse pronunciations for some words in the NETtalk15k corpus. Since a trained model produces only one single best pronunciation fora given word, the pronunciation can only be equal to one of the diverse references,resulting in additional errors for hypotheses that are actually correct.

5.3 Features

As presented in Chapter 4, using the Tagging CRF Model as a starting point, differentCRF Models were developed for the G2P conversion task. Below we compare themon both corpora: on the one hand computing an external alignment using the jointn-gram model of [Bisani & Ney 08], while on the other hand, applying the summationover all matching alignments approach. All results listed below were obtained aftera training with 50 iterations. The main purpose is to determine whether one of themodels outperforms the others in terms of PER and WER, such that we can limitourselves to this one in all following experiments.When using the summation approach, all feature weights were initialized with zero.Moreover, grapheme sequences shorter than the corresponding phoneme sequenceswere extended to equal length by additional equally distributed epsilons. Althoughthis requires the unknown length of phoneme sequences from the development and testcorpora, it is necessary in order to allow for an application of the model using an M-to-1 alignment. Further details concerning M-to-1 alignments can be found in Section3.3. Nevertheless, the results obtained using the summation approach on Celex areonly comparable with themselves and not with other scientific publications.The features that are included in the various CRF models are summarized in Table4.1.

54

Table 5.2. Evaluation of the CRF models on Celex. The training has been performedfollowing the maximum approach with an external alignment computed using the jointn-gram model of [Bisani & Ney 08].

dev testPER[%] WER[%] PER[%] WER[%]

[Bisani & Ney 08] 2.5 11.4

Tagging CRF Model 2.6 12.7 2.5 12.2CRF Model 1 2.6 12.8 2.6 12.5CRF Model 2 2.8 13.3 2.7 12.8CRF Model 3 3.0 14.8 2.8 13.8CRF Model 4 2.8 13.5 2.6 12.8CRF Model 5 2.7 13.2 2.5 12.2

5.3.1 Celex

Firstly, we begin with the evaluation of the models, which were introduced in Chapter4, on the Celex corpus using the joint n-gram model to compute an external alignment- Table 5.2 lists the results obtained. As can be easily seen, the best results on thedevelopment corpus are obtained when using the Tagging CRF Model. Nevertheless,CRF Model 1 performs almost equally as well, since the difference between them isstatistically insignificant.Secondly, we have carried out experiments on the same corpus with the same settingsexcept for the fact that for the experiments listed in Table 5.3, the summation over allmatching alignments approach has been applied. In this case, CRF Model 1 outper-forms all other models, including the Tagging CRF Model, on the development corpus.Nevertheless, CRF Model 4 performs better than the Tagging CRF Model and nearlyas well as CRF Model 1.Since this work is focused on the summation approach, for which CRF Model 1 yieldsthe best results on Celex, and due to the fact that CRF Model 1 performs nearly aswell as the Tagging CRF Model when using an external alignment, CRF Model 1 isthe favoured model for the Celex corpus.

5.3.2 NETtalk 15k

In addition, the same experiments as in the previous subection were carried out onthe NETtalk 15k corpus. Table 5.4 lists the results obtained when using an externalalignment computed using the joint n-gram model of [Bisani & Ney 08]. CRF Model2 yields the best results on the development corpus, whereas the results of CRF Model1 and 5 are both comparable, since the difference is only 0.1% PER.

55

Table 5.3. Evaluation of CRF models on Celex following the summation approach. Note thatthe grapheme sequences of the development and test corpora that are shorter than thecorresponding phoneme sequences have been extended to equal length by the insertion ofequally distributed epsilons. Thus, these results are not fully comparable to those of[Bisani & Ney 08], since the length of the phoneme sequence to be inferred is persumed tobe known.


[Bisani & Ney 08] 2.5 11.4


Table 5.4. Evaluation of the CRF models on NETtalk 15k. The training has been performedfollowing the maximum approach with an external alignment computed using the jointn-gram model of [Bisani & Ney 08].


[Bisani & Ney 08] 8.3 33.7


Finally, the results for the different CRF models in combination with the summationapproach are presented in Table 5.5. CRF Model 1 outperforms all other models.As stated above for the Celex corpus, CRF Model 1 performs best when applyingthe summation approach, on which this thesis is focused, and nearly as well as thenext-best model when using an external alignment.In summary, CRF Model 1 yields the best results for the summation over alignmentsapproach. Moreover, it performs nearly as well as the best model when using anexternal alignment: on Celex there is no difference in the PER, only 0.1% difference in

56

Table 5.5. Evaluation of CRF models on NETtalk 15k following the summation approach.


[Bisani & Ney 08] 8.3 33.7


WER, and on NETtalk 15k the difference is 0.1% PER and 0.6% WER. As a result,we restricted ourselves to CRF Model 1 in all further experiments.

5.4 Alignments

In order to compare the three different alignment approaches, experiments on bothCelex and NETtalk 15k were carried out using CRF Model 1. The external alignmentwas computed by means of the joint n-gram model [Bisani & Ney 08], the resegmen-tation experiment was carried out as described in Section 3.6. In the summationexperiment, all feature weights were initialized with zero. Table 5.6 shows the resultsfor the Celex corpus, as well as the WER published in [Jiampojamarn & Cherry+ 10].

So far, the summation and the resegmentation approaches (with epsilons included inthe grapheme sequences of the development and test corpora) perform equally well,yet worse than the external alignment approach. Although the latter uses the jointn-gram model implemented by [Bisani & Ney 08] to compute a single best alignment,the results obtained are worse than those obtained in [Bisani & Ney 08], where thejoint n-gram model is used for the G2P conversion itself. One reason may be that theexternal alignment implies an error propagation because the alignment itself has beencomputed using a statistical model and it has been optimized for the joint n-grammodel and not the CRF Model 1.One has to keep in mind that the results obtained by the resegmentation and summa-tion approaches (including epsilons in the grapheme sequences of the development andtest corpora) are not quite comparable, since they imply knowledge about the lengthof target sequences in both the development and the test corpus. The comparison be-tween the results obtained using the summation approach with and without epsilonsmakes this apparent: without epsilons, the PER and WER are significantly increased

57

Table 5.6. Evaluation of the different alignment approaches on Celex. Only the resultsobtained without epsilons in the development and test corpora can be compared to[Jiampojamarn & Cherry+ 10] and [Bisani & Ney 08], since those including epsilons requireinformation about the length of the phoneme sequences that have to be inferred. In orderto improve the performance of the resegmentation and summation approaches and to befully applicable to the G2P conversion task, the M-to-1 alignment model has to beextended by 1-to-2 alignment links.

dev testPER[%] WER[%] PER [%] WER[%]

[Jiampojamarn & Cherry+ 10] 10.8[Bisani & Ney 08] 2.5 11.4

external alignment 2.6 12.8 2.6 12.5sum over alignments (with ε’s) 2.8 13.6 2.8 13.4resegmentation (with ε’s) 3.0 14.9 2.9 14.1

sum over alignments (without ε’s) 3.2 15.3 3.1 15.1resegmentation (without ε’s) 3.4 16.6 3.2 15.9

on both development (0.4% PER and 1.7% WER) and test corpus (0.3% PER and1.7% WER). For the resegmentation experiment without epsilons we observe the sameeffect: on the development the error rates rise by 0.2% PER and 1.3% WER, on thetest corpus by 0.1% PER and 0.8% WER. In order to improve the results, the M-to-1alignment model has to be extended by 1-to-2 alignment links, such that the CRFmodel becomes fully applicable to the G2P conversion task on all corpora.In addition, experiments with the same settings as described above were carried outon the NETtalk 15k corpus - their results are presented in Table 5.7. Unfortunately,no results were published by [Jiampojamarn & Cherry+ 10] on this corpus.Firstly, all three approaches outperform the joint n-gram model of [Bisani & Ney 08]in terms of PER on the test corpus, even though the WER is slightly worse, whichmeans that even though there are less phoneme errors, these errors are distributedmore equally on the sequence pairs. Because the NETtalk 15k corpus makes use ofdouble phonemes, i.e. it does not contain sequence pairs where the phoneme sequenceis longer than the corresponding grapheme sequence, our M-to-1 alignment model isfully applicable because the source sequences do not have to be extended, hence noknowledge about phoneme sequences in the development and test corpus is implied.Therefore, the results obtained are comparable and clearly outperform the baseline.Secondly, the resegmentation and summation approaches yield nearly equal results, i.e.the summation approach performs slightly worse. Nevertheless, the external alignmentoutperforms both of them.During the next experiments, we will try to improve the summation approach such

58

Table 5.7. Evaluation of the different alignment approaches on NETtalk 15k. No results havebeen published by [Jiampojamarn & Cherry+ 10] on this corpus. All three approachesoutperform the joint n-gram model of [Bisani & Ney 08] in terms of PER on the testcorpus. The M-to-1 alignment model is fully applicable, since the NETtalk 15k corpusincludes double phonemes, thus it does not contain sequence pairs where the phonemesequence is longer than the corresponding grapheme sequence.


[Bisani & Ney 08] 8.3 33.7

external alignment 7.3 32.3 7.8 33.7resegmentation 7.5 33.8 8.0 34.3sum over alignments 7.5 33.9 8.1 34.7

that it becomes fully applicable on the Celex corpus and obtains improved results. Weare going to restrict ourselves to the Celex corpus, since it is not artificial in terms ofdouble phonemes. Thus, it requires an extension of the M-to-1 alignment model. Thisalignment handling should lead to better models, such that the baseline results can bereached.

5.5 Initializations

Because the optimization problem is not convex anymore when introducing the align-ment as a hidden variable, it depends on the initialization as to whether the global orjust an local optimum is reached during training (see Section 3.8). Therefore, severalinitializations described in Section 3.9 have been tested. All experiments below werecarried out using CRF Model 1 and the summation approach on the Celex corpus,with 150 iterations in training instead of 50 - the results are listed in Table 5.8. Wehave increased the number of iterations by 100 because, as we have observed duringthe experiments, the training of a model using the summation approach and initializedwith IBM Model 1 needs as many iterations to fully converge, i.e. there is an improve-ment of 0.5% WER on the development corpus for the IBM Model 1 initializationbetween the 50th and 150th iteration.The 0-model is not the worst initialization, since it performs better than the initial-ization with the maximum alignment model trained on a resegmented corpus and ofall three initializations with relative frequencies, only the initialization of all featuresyields a small improvement. When initializing with the IBM Model 1, one obtains aclear improvement on the development corpus, although on the test corpus, only theWER is improved.

59

Table 5.8. Evaluation of the different initializations for the summation approach on Celexusing CRF Model 1. The number of iterations has been increased to 150, since the trainingof a model using the summation approach and initialized with IBM Model 1 needs as manyiterations to converge.


[Bisani & Ney 08] 2.5 11.4

0-model 2.8 13.3 2.6 12.8max-model (resegmentated corpus) 2.8 14.0 2.7 13.6IBM Model 1 2.7 12.7 2.6 12.6relative frequencies (hk(eaj , fj)) 2.8 13.5 2.6 12.8relative frequencies (hk(eaj , fj±δ)) 2.8 13.5 2.6 12.9relative frequencies (all features) 2.7 13.3 2.7 13.0

5.6 Doubled Grapheme Sequences

As shown in Section 3.3, the M-to-1 alignment model is not applicable to corpora likeCelex, which contain phoneme sequences longer than their corresponding graphemesequences, due to graphemes generating multiple phonemes. An example is given inFigure 3.6. So far, the solution has been to extend the grapheme sequence by additionalequally distributed epsilons up to a length equal to the phoneme sequence length, asdescribed in Subsection 3.3.1. Unfortunately, this is not really applicable since thelength of a phoneme sequence is unknown in the search procedure. Nevertheless, thisapproach is a first step in extending the CRF model used in the tagging domain, suchthat it becomes applicable for G2P conversion. Even without any extension, the M-to-1 alignment model allows for the evaluation of different CRF models, see Section5.3. Moreover, it works well for corpora that do not contain graphemes that generatemultiple phonemes, e.g. NETtalk 15k.Our following experiments are based on the assumption that 1-to-2 alignment links aresufficient for English corpora, i.e. we assume that no grapheme produces more than twophonemes. This does not hold true for abbreviations, as shown in Figure 5.1. Becausesuch abbreviations are very rare and provide no information about the pronunciationof graphemes, we remove these examples from the training corpus instead of allowingfor 1-to-n alignment links with n ≥ 3.In order to find a solution for the 1-to-2 alignment problem described above, i.e. tomake our model applicable for arbitrary English corpora, e.g. Celex, our first approachis to extend all grapheme sequences in the training, development and test corpus.Three different methods for extending the grapheme sequences were investigated andwill be illustrated below. Figure 5.2, which revisits the example given in Figure 3.6,

60

“mr”[mIst@R] =

mr[mIst@R]

Figure 5.1. Example of an abbreviation with 2 · J < I, where the assumption that nographeme produces more than two phonemes does not hold, i.e. a doubled graphemesequence is not sufficient to allow for an alignment. We remove such examples from thetraining corpus, because they contain no information about the pronunciation of graphemes.

depicts a sequence pair including 1-to-2 alignment links, but without any extension ofthe grapheme sequence at all.


e[I]

x[ks]

p[p]

i[2@]

r[r]

y[I]

Figure 5.2. Example of an alignment including 1-to-2 links for a sequence pair with J < I,i.e. without any grapheme sequence extension.

Firstly, the grapheme sequence can be extended by inserting an epsilon after eachgrapheme, as illustrated in Figure 5.3. Thus, in cases of a 1-to-1 alignment link, anepsilon at position j can be aligned to the phoneme eaj−1 , and in cases of a 1-to-2alignment link the phoneme can be aligned to the first phoneme, while the epsilon isaligned to the second one.


e ε[I]

x[k]

ε[s]

p ε[p]

i[2]

ε[@]

r ε[r]

y ε[I]

Figure 5.3. Example of an alignment for a sequence pair with J < I. The graphemesequence is extended by the insertion of an epsilon after each grapheme. Thus, previous1-to-1 links can transformed into 2-to-1 links and previous 1-to-2 links can modelled as two1-to-1 links.

Secondly, named epsilons can be inserted instead of epsilons, as presented in Figure5.4.Finally, the grapheme sequence could be extended by simply doubling all graphemes.This is illustrated in Figure 5.5.In order to determine which of the three extension methods performs best, we havecarried out three experiments that differ only in the extension methods, i.e. all ofthem are carried out using the Tagging CRF Model and the summation approach onCelex over 50 iterations and initialized with the 0-model. Table 5.9 shows the results

61


e εe[I]

x[k]

εx[s]

p εp[p]

i[2]

εi[@]

r εr[r]

y εy[I]

Figure 5.4. Example of an alignment for a sequence pair with J < I. The graphemesequence is extended by the insertion of an named epsilon after each grapheme.


e e[I]

x[k]

x[s]

p p[p]

i[2]

i[@]

r r[r]

y y[I] .

Figure 5.5. Example of an alignment for a sequence pair with J < I. The graphemesequence is extended by the doubling of each grapheme.

obtained. The reason we picked the Tagging CRF Model instead of CRF Model 1is, that its training is less time consuming, due to the decoupling of alignment stepsand target sequence, which introduces an additional complexity into the implementa-tion, but still allows for a comparison to the previous approach of inserting equallydistributed epsilons.All three yield quite similar results, although the extension by double graphemes isslightly better. Nevertheless, all three are still far worse in comparison to the previousapproach. This is probably because the source sequence is twice as long, which implies,in principle, that the lexical and combined lexical features only have the range theyhad before. Therefore, we have increased both the lexical feature range δlex and thecombined lexical feature range δcmb to 6. Table 5.10 shows the results obtained incomparison to the ones obtained when using a range of 5, as well as the ones obtainedwhen inserting equally distributed epsilons only in cases where J < I and using ranges

Table 5.9. Evaluation of the different grapheme sequence extensions on Celex using theTagging CRF Model, trained following the summation approach for 50 iterations. In theseexperiments, the Tagging CRF Model was preferred to CRF Model 1, since its training isless time consuming. Nevertheless, it allows for a comparison to the previous approach ofinserting equally distributed epsilons.

grapheme sequence extension dev testPER[%] WER[%] PER[%] WER[%]

equally distributed epsilons 3.1 15.0 2.9 14.2

epsilons 5.3 25.7 5.3 25.6named epsilons 5.2 25.5 5.3 25.7double graphemes 5.2 24.7 5.1 25.0

62

Table 5.10. Evaluation of the insertion of double graphemes on Celex, using different sourcecontext sizes δlex and δcmb. As in the experiments presented in Table 5.9, the Tagging CRFModel was trained following the summation approach for 50 iterations.

grapheme sequence δlex, δcmb dev testextension PER[%] WER[%] PER[%] WER[%]

equally distributed ε’s 3 4.3 20.7 4.0 19.9

double graphemes 5 5.2 24.7 5.1 25.0double graphemes 6 5.3 24.9 5.1 24.5

of 3, i.e. exactly half since the source sequence is not doubled.Unexpectedly, the results get even worse when increasing the range of features on thesource side. Moreover, they are a good deal worse in comparison to the results whenusing half the range without doubling the graphemes, i.e. practically the same range.Since only a comparatively small number of graphemes generate two phonemes, wecan conclude that, although we provide a solution for these cases, too many errors areintroduced by simply doubling every grapheme.

5.7 Grapheme Sequence Extension using the Joint n-gramModel

Because the insertion of additional symbols after every grapheme yields no improve-ment, we draw the conclusion that, in order to solve the 1-to-2 alignment problem, theadditional symbols should be inserted only at the relevant positions in the graphemesequence.Therefore, the first approach is to train the joint n-gram model of [Bisani & Ney 08]and apply it to the training, development and test corpora - as is done in the externalalignment approach - and thereby insert named epsilons at the most likely positions.Thus, only the grapheme sequences are changed, such that all source sequences are atleast of the same length as their corresponding target sequences and named epsilonsare included after graphemes which the joint n-gram model thinks are related to twophonemes.We have trained the CRF Model 1 on the Celex corpus, preprocessed as describedabove, for 50 iterations using the summation over all matching M-to-1 alignments andinitialized with the 0-model. Table 5.11 presents its evaluation compared to the resultsobtained when trained and evaluated on the corpus extended with equally distributedepsilons. On the one hand, we have provided a valid way of applying the M-to-1 align-ment model using the summation approach to G2P conversion. On the other hand,this new approach of combining the joint n-gram model in preprocessing with the

63

Table 5.11. Evaluation of named epsilons inserted using the joint n-gram model on Celex incomparison to equally distributed epsilons. In these experiments, CRF Model 1 was trainedfor 50 iterations following the summation approach.


equally distributed ε’s 2.8 13.5 2.8 13.4named ε’s at most likely pos. 2.7 12.8 2.6 12.2

Table 5.12. Evaluation of the summation approach in comparison to the maximum approachusing an external alignment computed by the joint n-gram model of [Bisani & Ney 08] onCelex with named epsilons inserted using the joint n-gram model.


max. approach (joint n-gram) 2.6 12.8 2.6 12.5summation approach 2.7 12.8 2.6 12.2

summation over all matching alignments in training outperforms the previous resultsfor the summation approach on both the development and test corpora. This improve-ment is due to the equally distributed epsilons at the most likely positions. Althoughthey are inserted using an external model, which implies an error propagation, thereis a clear improvement.In Table 5.12, we compare the previous results to the results obtained after 50 iter-ations. These were computed using the joint n-gram model to compute an externalalignment instead of extending only the grapheme sequences and summing over allmatching alignments afterwards. Both experiments yield quite similar results, the ex-ternal alignment approach being slightly better on the development corpus, while thesummation approach performs somewhat better on the evaluation corpus.In order to further improve the results obtained when using the combination of thejoint n-gram model and summation over matching alignments, we have additionallyinitialized the CRF Model 1 with the IBM Model 1, as shown in Section 5.5. Table 5.13shows the results after 50 iterations in comparison to the results of the correspondingexperiment using 0-model initialization, as well as the external alignment experimentusing the joint n-gram model. The additional initialization with IBM Model 1 clearlyoutperforms the previous 0-model initialization, as well as the external alignment onboth the development and evaluation corpora.

During our initialization experiments in Section 5.5, we have observed that thereis some improvement, if the number of iterations in training is increased by 100.Therefore, we trained the upper CRF Model 1 initialized with IBM Model 1 for 150

64

Table 5.13. Evaluation of the summation approach initialized with IBM Model 1 incomparison to the maximum approach using an external alignment computed by the jointn-gram model of [Bisani & Ney 08] on Celex with named epsilons inserted using jointn-gram model.


max. approach (joint n-gram) 2.6 12.8 2.6 12.5summation approach, 0-model 2.7 12.8 2.6 12.2summation approach, IBM 1 Model 2.6 12.3 2.5 12.0

Table 5.14. Evaluation of the summation approach initialized with IBM Model 1 incomparison to [Bisani & Ney 08] on Celex with named epsilons inserted using the jointn-gram model.


[Bisani & Ney 08] 2.5 11.4summation approach, IBM 1 Model 2.5 12.0 2.4 11.8

iterations before evaluating it on the test corpus. The results compared to the baselineare shown in Table 5.14. When trained for 150 iterations, this approach allows us toobtain a better PER than the pure joint n-gram model used by [Bisani & Ney 08].The fact that the WER is worse only means that the phoneme errors are more equallydistributed over the sequence pairs, since the absolute number of total phoneme errorsis less.In summary, we have shown that using the joint n-gram model to extend the graphemesequences with named epsilons at the most likely positions, as well as initializing theCRF Model 1 with the IBM Model 1, allows for a model trained on all matching M-to-1alignments that outperforms the baseline by [Bisani & Ney 08]. Since the summationapproach optimization is not convex, the initialization plays a crucial role. Moreover,extending the source sequences at the most likely positions allows us to use the M-to-1alignment model in an application domain with 1-to-2 alignment links.

5.8 1-to-2 Alignment Links

Motivated by the results on the Celex corpus - with its grapheme sequences extendedby the insertion of named epsilons using the joint n-gram model of [Bisani & Ney 08]- in Section 5.7, which are obtained using CRF Model 1 initialized with IBM Model 1and following the summation approach, we have developed our CRF software further,

65

Table 5.15. Evaluation of various initializations for CRF Model 1 with 1-to-2 alignment linksfollowing the summation approach on Celex. The number of iterations has been increasedto 125, since the additional inclusion of 1-to-2 alignment links affords that many iterationsto achieve convergence.


[Jiampojamarn & Cherry+ 10] 10.8[Bisani & Ney 08] 2.5 11.4

0-model 3.0 14.3 3.0 14.2IBM Model 1 2.6 12.2 2.4 11.9relative frequencies (hk(eaj , fj)) 3.1 14.3 3.1 14.3relative frequencies (hk(eaj , fj±δ)) 3.1 14.5 3.0 14.2relative frequencies (all features) 3.2 14.7 3.0 14.2

such that a grapheme can be aligned to two phonemes as well as to one phoneme orone named epsilon (see Section 3.4). Firstly, this enables the application of the CRFsoftware following the summation approach independent of any further external modelexcept for IBM Model 1, i.e. the preprocessing step of inserting named epsilons usingthe joint n-gram model of [Bisani & Ney 08] is eliminated. Note that the initializationwith IBM Model 1 can be easily integrated into the CRF software. Secondly, weeliminate the error propagation implied by the insertion of named epsilons using astatistical model, since the CRF model includes 1-to-2 alignment links itself. Hence,the CRF software with integrated 1-to-0, 1-to-1 and 1-to-2 alignment links becomesfully applicable to the G2P corpora Celex and NETtalk 15k.We have trained the CRF Model 1 following the summation approach and including1-to-2 alignment links with various initialization for 125 iterations. We have performed75 additional iterations, because the results on the development corpus have shownthat the additional inclusion of 1-to-2 alignment links affords that many iterations toachieve convergence. The results are presented in Table 5.15.On the one hand, the 0-model and relative frequencies initializations yield quite com-parable results on both, development and evaluation corpora. Considering the PER onthe development corpus, the initialization with the 0-model performs slightly betterthan the relative frequencies initializations. On the evaluation corpus, the 0-modelinitialization performs as well as the relative frequencies initializations of the lexicalfeatures.On the other hand, the IBM Model 1 initialization outperforms the relative frequenciesand 0-model initializations on both, development and evaluation corpora. The differ-ence between the 0-model and the IBM Model 1 initialization on the development cor-pus is 0.4% PER and 2.1% WER, on the evaluation corpus 0.6% PER and 2.3% WER.

66

In addition, the initialization with IBM Model 1 outperforms the joint n-gram model of[Bisani & Ney 08]. Since the PER of the IBM Model 1 initialization is lower, the higherWER indicates only that the phoneme errors are more equally distributed, i.e. thatthere are more words containing at least one error, but less errors in total. Unfortu-nately, the PER on the evaluation corpus obtained by [Jiampojamarn & Cherry+ 10]is unknown to us, thus we cannot compare the absolute number of errors made. Aspresented in Table 5.15, the WER obtained by [Jiampojamarn & Cherry+ 10] is bet-ter by 1.1% in comparison to ours, i.e. our system generates more pronunciationscontaining at least one phoneme error, although we do not know whether we producea higher or lower number of total phoneme errors.As a result, we obtain that the CRF Model 1 including 1-to-2 alignment links, sum-ming over all matching alignments and initialized with IBM Model 1 is a stand-aloneG2P conversion system and independent of any precomputed alignments, which out-performs the joint n-gram model of [Bisani & Ney 08].

67

Chapter 6

Conclusion and Outlook

In this thesis, we implemented a stand-alone CRF model with integrated alignmentsfor G2P conversion by further developing the implementation of CRFs used for taggingbefore.In order to eliminate the preprocessing step of computing a single best alignment,we introduced the summation over all matching alignments into the training of theCRF model. Hence, we obtained the independence of an external model, eliminatedthe preprocessing step and avoided any error propagation. Because we introducedthe alignments as latent variables into the CRF model, the optimization problem isnot convex anymore. Therefore, we investigated various initializations, of which theinitialization with IBM Model 1 performed best. Although the initialization withIBM Model 1 needs a preprocessing step, it could be easily integrated into the CRFimplementation, thus reducing human work. Moreover, we introduced 1-to-2 alignmentlinks into our CRF model to make the CRF model fully applicable on the correspondingG2P corpora. If necessary, the CRF software is extendable by 1-to-n alignment linkswith n ≥ 3. Finally, we decoupled the target sequences and the alignments, whichallowed for the implementation of new reasonable feature functions, resulting in CRFModel 1 to 5.Using the CRF software, which was initially designed for concept tagging, as a startingpoint, we developed the implementation further and fully integrated the alignments,i.e. gained independence of any external alignment model, implemented new featurefunctions that further improve the performance, investigated various initializations toprovide a practical solution to the convexity problem and included 1-to-2 alignmentlinks to the 1-to-1 alignment model of the CRF software. As a result, we obtained astand-alone G2P conversion system, applicable on the Celex and NETtalk 15k corporaand easily extendable by 1-to-n alignment links to be applicable on other corpora invarious languages as well.In Chapter 5, we showed that CRF Model 1 - following the summation approach andinitialized with the 0-model - outperforms the state-of-the-art joint n-gram model of[Bisani & Ney 08] on NETtalk 15k. Note that the inclusion of 1-to-2 alignment linkswas not necessary, since NETtalk 15k includes double phonemes and therefore doesnot have graphemes generating more than one phoneme.

69

In addition, initializing CRF Model 1 with IBM Model 1, including 1-to-2 alignmentlinks and summing over all matching alignments in the training of the model param-eters outperforms the joint n-gram model of [Bisani & Ney 08] on Celex. Moreover,it yields results comparable to the state-of-the-art online discriminative training of[Jiampojamarn & Cherry+ 10].In order to improve the system, more sophisticated initializations could be investigated,possibly leading to even better results.Furthermore, the sum over all matching alignments, which have a scoring higher thana certain threshold, could be tested in the inference of the most probable pronunciationof a given grapheme sequence.The 1-to-2 alignment links could be improved in the following way: so far, a graphemefj can be aligned to two phonemes ei and ei+1 and its successor fj+1 can be also alignedto ei+1. From the current point of view, we propose that either n ≥ 1 graphemes shouldbe aligned to one phoneme or one grapheme should be aligned to two phonemes, towhom no other grapheme is aligned to, since a single grapheme generates two phonemeswithout the contribution of any other phonemes. Implementing this additional con-straint would decrease the number of possible alignments and presumably improvethe performance, since current CRF models seem to prefer the generation of manyunreasonable 1-to-2 alignment links.In addition, it would be interesting to see how the system performs on other corpora,in particular such corpora that require 1-to-n links with n ≥ 3, and compare theperformance of the system to various state-of-the-art systems.Due to the more sophisticated features (i.e. the decoupling of target sequences andalignments), the initializations using IBM Model 1 or relative frequencies and the 1-to-2 alignment links, the system has become considerably slow in comparison to itsprevious application in concept tagging. Thus, we suggest further improvement of itsefficiency.A higher efficiency combined with trigrams or even n-grams could allow for the ap-plication of CRFs with integrated alignments in the domain of monotonic translation.Note that an improvement of the efficiency is necessary to allow for an applicationwithin domains that imply larger input and output vocabularies.

70

List of Figures

1.1 Examples of grapheme-phoneme sequence pairs and possible alignments 21.2 Example of different possible alignments for a grapheme-phoneme se-

quence pair using the joint-sequence model . . . . . . . . . . . . . . . 31.3 Example of two disagreeing n-best list entries in the alignment by ag-

gregation approach in [Jiampojamarn & Kondrak 10] . . . . . . . . . . 51.4 Example of a resulting M-to-N alignment created by the alignment by

aggregation approach in [Jiampojamarn & Kondrak 10] . . . . . . . . 6

2.1 Example of unaligned grapheme-phoneme pairs . . . . . . . . . . . . . 102.2 Example of an unaligned grapheme-phoneme pair . . . . . . . . . . . . 10

3.1 Example of unaligned grapheme-phoneme pairs . . . . . . . . . . . . . 243.2 Possible alignments for the grapheme-phoneme pairs presented in Figure

3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 Example of a possibly correctly inferred grapheme-phoneme pair includ-

ing an alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4 General structure of a monotonic M-to-1 alignment with 1-to-0 and

1-to-1 alignment links . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.5 Example of an extended phoneme sequence . . . . . . . . . . . . . . . 283.6 Example of an alignment including 1-to-2 links for a sequence pair with

J < I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.7 Example of an alignment for sequence pair with J < I with uniformly

distributed ε’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.8 Example of an alignment for a sequence pair with J < I computed with

the joint n-gram model of [Bisani & Ney 08] . . . . . . . . . . . . . . . 303.9 Example of the 1-to-1 alignment for sequence pair with J = I . . . . . 303.10 Example of 1-to-1 alignments for a sequence pair with J < I . . . . . 303.11 Possible M-to-1 alignments for a sequence pair with J > I . . . . . . . 313.12 Representation of the sequence pair and all matching M-to-1 alignments

presented in Figure 3.11 as an FSA in the CRF implementation . . . . 313.13 Reformulation of the alignment problem for the example presented in

Figure 3.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.14 Reformulation of the alignment problem annotated with the number of

possible paths for the example presented in Figure 3.11 . . . . . . . . 33

71

3.15 General structure of a monotonic M-to-1 alignment with 1-to-0, 1-to-1and 1-to-2 alignment links . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.16 Example of a possible alignment including 1-to-2 links for a grapheme-phoneme sequence pair with J < I . . . . . . . . . . . . . . . . . . . . 34

3.17 Representation of the sequence pair and all matching M-to-1 alignmentspresented in Figure 3.17 as an FSA in the CRF implementation . . . . 35

3.18 CRF training using an external alignment . . . . . . . . . . . . . . . . 363.19 Example of a linear segmentation . . . . . . . . . . . . . . . . . . . . . 373.20 CRF training using resegmentation . . . . . . . . . . . . . . . . . . . . 383.21 CRF training using an external alignment . . . . . . . . . . . . . . . . 39

5.1 Example of an abbreviation with 2 · J < I . . . . . . . . . . . . . . . . 615.2 Example of an alignment including 1-to-2 links for a sequence pair with

J < I, i.e. without any grapheme sequence extension . . . . . . . . . . 615.3 Example of an alignment for a sequence pair with J < I, extended by

the insertion of epsilons . . . . . . . . . . . . . . . . . . . . . . . . . . 615.4 Example of an alignment for a sequence pair with J < I, extended by

the insertion of named epsilons . . . . . . . . . . . . . . . . . . . . . . 625.5 Example of an alignment for a sequence pair with J < I, extended by

the doubling of each grapheme . . . . . . . . . . . . . . . . . . . . . . 62

72

List of Tables

4.1 Overview of the distinct CRF models and their features . . . . . . . . 51

5.1 G2P corpora statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.2 Evaluation of the CRF models on Celex following the maximum ap-

proach with an external alignment computed using the joint n-grammodel of [Bisani & Ney 08] . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3 Evaluation of CRF models on Celex following the summation approach 565.4 Evaluation of the CRF models on NETtalk 15k following the maximum

approach with an external alignment computed using the joint n-grammodel of [Bisani & Ney 08] . . . . . . . . . . . . . . . . . . . . . . . . 56

5.5 Evaluation of CRF models on NETtalk 15k following the summationapproach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.6 Evaluation of the different alignment approaches on Celex . . . . . . . 585.7 Evaluation of the different alignment approaches on NETtalk 15k . . . 595.8 Evaluation of the different initializations for the summation approach

on Celex using CRF Model 1 . . . . . . . . . . . . . . . . . . . . . . . 605.9 Evaluation of the different grapheme sequence extensions on Celex . . 625.10 Evaluation of the insertion of double graphemes on Celex, using different

source context sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.11 Evaluation of named epsilons inserted using the joint n-gram model on

Celex in comparison to equally distributed epsilons . . . . . . . . . . . 645.12 Evaluation of the summation approach in comparison to the maximum

approach on Celex with named epsilons inserted using the joint n-grammodel of [Bisani & Ney 08] . . . . . . . . . . . . . . . . . . . . . . . . 64

5.13 Evaluation of the summation approach initialized with IBM Model 1 incomparison to the maximum approach on Celex with named epsilonsinserted using the joint n-gram model of [Bisani & Ney 08] . . . . . . 65

5.14 Evaluation of the summation approach initialized with IBM Model 1 incomparison to [Bisani & Ney 08] on Celex with named epsilons insertedusing the joint n-gram model . . . . . . . . . . . . . . . . . . . . . . . 65

5.15 Evaluation of various initializations for CRF Model 1 with 1-to-2 align-ment links following the summation approach on Celex . . . . . . . . . 66

73

Bibliography

[Bisani & Ney 08] M. Bisani, H. Ney: Joint-Sequence Models for Grapheme-to-Phoneme Conversion. Speech Communication, Vol. 50, No. 5, pp. 434–451, May2008.

[Crammer & Singer 03] K. Crammer, Y. Singer: Ultraconservative online algorithmsfor multiclass problems. J. Mach. Learn. Res., Vol. 3, pp. 951–991, March 2003.

[Engelbrecht & Schultz 05] H. Engelbrecht, T. Schultz: Rapid Development of anAfrikaans-English Speech-to-Speech Translator. In International Workshop on Spo-ken Language Translation, pp. 169–176, Pittsburgh, PA, USA, 2005.

[Hahn & Lehnen+ 08] S. Hahn, P. Lehnen, C. Raymond, H. Ney: A Comparisonof Various Methods for Concept Tagging for Spoken Language Understanding. InInternational Conference on Language Resources and Evaluation, Marrakech, Mo-rocco, May 2008.

[Hahn & Lehnen+ 09] S. Hahn, P. Lehnen, G. Heigold, H. Ney: Optimizing CRFs forSLU Tasks in Various Languages Using Modified Training Criteria. In Interspeech,pp. 2727–2730, Brighton, U.K., Sept. 2009.

[Heigold & Wiesler+ 10] G. Heigold, S. Wiesler, M. Nussbaum, P. Lehnen,R. Schluter, H. Ney: Discriminative HMMs, Log-Linear Models, and CRFs: Whatis the Difference? In IEEE International Conference on Acoustics, Speech, andSignal Processing, pp. 5546–5549, Dallas, Texas, USA, March 2010.

[Jiampojamarn & Cherry+ 08] S. Jiampojamarn, C. Cherry, G. Kondrak: Joint Pro-cessing and Discriminative Training for Letter-to-Phoneme Conversion. In Proceed-ing of the Annual Meeting of the Association for Computational Linguistics: HumanLanguage Technologies (ACL-08: HLT), pp. 905–913, Columbus, OH, USA, June2008.

[Jiampojamarn & Cherry+ 10] S. Jiampojamarn, C. Cherry, G. Kondrak: Integrat-ing Joint n-gram Features into a Discriminative Training Framework. In HumanLanguage Technologies: The 2010 Annual Conference of the North American Chap-ter of the Association for Computational Linguistics, pp. 697–700, Los Angeles,California, June 2010. Association for Computational Linguistics.

75

[Jiampojamarn & Kondrak+ 07] S. Jiampojamarn, G. Kondrak, T. Sherif: ApplyingMany-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Con-version. In Proceedings of the Annual Conference of the North American Chapter ofthe Association for Computational Linguistics (NAACL-HLT 2007), pp. 372–379,Rochester, NY, USA, April 2007.

[Jiampojamarn & Kondrak 09] S. Jiampojamarn, G. Kondrak: Online DiscriminativeTraining for Grapheme-to-Phoneme Conversion. In Proceedings of ISCA Interspeech,pp. 1303–1306, Brighton, U.K., Sept. 2009.

[Jiampojamarn & Kondrak 10] S. Jiampojamarn, G. Kondrak: Letter-PhonemeAlignment: An Exploration. In Proceedings of the 48th Annual Meeting of the As-sociation for Computational Linguistics, pp. 780–788, Uppsala, Sweden, July 2010.Association for Computational Linguistics.

[Lafferty & McCallum+ 01] J.D. Lafferty, A. McCallum, F.C.N. Pereira: ConditionalRandom Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.In ICML ’01: Proceedings of the Eighteenth International Conference on MachineLearning, pp. 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publish-ers Inc.

[Levenshtein 66] V. Levenshtein: Binary Codes Capable of Correcting Deletions, In-sertions and Reversals. In Soviet Physics Doklady, Vol. 10, 1966.

[Pearl 88] J. Pearl: Probabilistic reasoning in intelligent systems: networks of plausibleinference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.

[Quattoni & Wang+ 07] A. Quattoni, S. Wang, L.P. Morency, M. Collins, T. Darrell:Hidden Conditional Random Fields. IEEE Transactions on Pattern Analysis andMachine Intelligence, Vol. 29, pp. 1848–1852, 2007.

[Riedmiller & Braun 93] M. Riedmiller, H. Braun: A Direct Adaptive Method forFaster Backpropagation Learning: The RPROP Algorithm. In IEEE InternationalConference on Neural Networks, pp. 586–591, 1993.

[Schroeter & Conkie+ 02] J. Schroeter, A. Conkie, A. Syrdal, M. Beutnagel, M. Jilka,V. Strom, Y.J. Kim, H.G. Kang, D. Kapilow: A perspective on the next challangesfor TTS. In IEEE 2002 Workshop in Speech Synthesis, pp. 11–13, Santa Monica,CA, 2002.

[Toutanova & Moore 02] K. Toutanova, R.C. Moore: Pronunciation modeling for im-proved spelling correction. In Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics, ACL ’02, pp. 144–151, Stroudsburg, PA, USA, 2002.Association for Computational Linguistics.

76

[Yu & Lam 08] X. Yu, W. Lam: Hidden Dynamic Probabilistic Models for LabelingSequence Data. In Proceedings of the Twenty-Third AAAI Conference on ArtificialIntelligence, pp. 739–745, Chicago, IL, USA, July 2008.

[Zens & Ney 04] R. Zens, H. Ney: Improvements in Phrase-Based Statistical MachineTranslation. In Human Language Technology Conf. / North American Chapter ofthe Assoc. for Computational Linguistics Annual Meeting, pp. 257–264, Boston,MA, May 2004.

77

diplomarbeit im fach informatik grapheme to phoneme conversion using crfs with integrated alignments...

Documents