unsupervised language model adaptation for handwritten chinese text recognition

Author's Accepted Manuscript

Unsupervised language model adaptation forhandwritten Chinese text recognition

Qiu-Feng Wang, Fei Yin, Cheng-Lin Liu

PII: S0031-3203(13)00387-7DOI: http://dx.doi.org/10.1016/j.patcog.2013.09.015Reference: PR4919

To appear in: Pattern Recognition

Received date: 13 December 2012Revised date: 17 September 2013Accepted date: 19 September 2013

Cite this article as: Qiu-Feng Wang, Fei Yin, Cheng-Lin Liu, Unsupervisedlanguage model adaptation for handwritten Chinese text recognition, PatternRecognition, http://dx.doi.org/10.1016/j.patcog.2013.09.015

This is a PDF file of an unedited manuscript that has been accepted forpublication. As a service to our customers we are providing this early version ofthe manuscript. The manuscript will undergo copyediting, typesetting, andreview of the resulting galley proof before it is published in its final citable form.Please note that during the production process errors may be discovered whichcould affect the content, and all legal disclaimers that apply to the journalpertain.

www.elsevier.com/locate/pr

http://dx.doi.org/10.1016/j.patcog.2013.09.015






Unsupervised Language Model Adaptation forHandwritten Chinese Text Recognition

Qiu-Feng Wang, Fei Yin, Cheng-Lin Liu

National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of SciencesNo.95 Zhongguancun East Road, Beijing 100190, P.R. China

Email: {wangqf, fyin, liucl}@nlpr.ia.ac.cn

Abstract

This paper presents an effective approach for unsupervised language model adaptation (LMA) using multi-

ple models in offline recognition of unconstrained handwritten Chinese texts. The domain of the document

to recognize is variable and usually unknown a priori, so we use a two-pass recognition strategy with a

pre-defined multi-domain language model set. We propose three methods to dynamically generate an adap-

tive language model to match the text output by first-pass recognition: model selection, model combination

and model reconstruction. In model selection, we use the language model with minimum perplexity on

the first-pass recognized text. By model combination, we learn the combination weights via minimizing

the sum of squared error with both L2-norm and L1-norm regularization. For model reconstruction, we

use a group of orthogonal bases to reconstruct a language model with the coefficients learned to match

the document to recognize. Moreover, we reduce the storage size of multiple language models using two

compression methods of split vector quantization (SVQ) and principal component analysis (PCA). Com-

prehensive experiments on two public Chinese handwriting databases CASIA-HWDB and HIT-MW show

that the proposed unsupervised LMA approach improves the recognition performance impressively, partic-

ularly for ancient domain documents with the recognition accuracy improved by 7 percent. Meanwhile, the

combination of the two compression methods largely reduces the storage size of language models with little

loss of recognition accuracy.

Keywords: Character string recognition, Chinese handwriting recognition, unsupervised language model

adaptation, language model compression.

Preprint submitted to Pattern Recognition September 26, 2013

1. Introduction

Handwritten Chinese character recognition has attracted a lot of attention since the 1970s and has

achieved tremendous advances [1, 2]. However, the recognition of unconstrained handwritten Chinese

texts has been reported only in recent years, and the reported accuracies are quite low (e.g., character-level

correct rate of 39 percent in [3]). Besides the divergence of writing styles, handwritten text recognition

is difficult due to the weak lexical constraint: the number of sentence classes is infinite. Although our

recent work by integrating multiple contexts including linguistic context achieved a high correct rate of 91

percent [4], there are still many recognition errors due to the insufficient modeling of linguistic context, par-

ticularly, the language model often mismatches the domain of handwritten text. To deal with this mismatch

problem, we investigate language model adaptation for handwritten Chinese text recognition (HCTR) in

this paper, particularly, unsupervised adaptation for the scenario that there is no prior information about the

domain of text.

Language model (LM) provides a principled way to quantify the uncertainties associated with the nat-

ural language, but this uncertainty is variable in the texts of different domains. The modeling of diverse

domains was originally studied in speech recognition [5], leading to the research of language model adap-

tation. The researchers often combine a generic LM with a domain-specific LM that is more relevant to the

recognition task. However, it is usually difficult to get enough domain-specific data to learn the domain-

specific model [5], and a growing interest is evident in collecting texts from the Internet to supplement

sparse domain-specific resources [6]. For Chinese resources, the Sogou Labs1 provide a large set of re-

sources relevant to diverse domains extracted from the Internet, which can be used to train a multi-domain

LM set via the SRILM toolkit [7]. This LM set can be adaptively applied in HCTR according to the domain

of each document.

Language model adaptation (LMA) is difficult when the domain of recognition task is unknown a priori,

which can only be solved by unsupervised approaches. Many efforts have been made in this direction in

speech recognition. One method is to calculate the probability of one word given a document (herein, the

history of recognized text) as an uni-gram model by latent topic analysis [5, 8, 9, 10], then interpolate it

with the generic LM. Another popular method is to use the recognized text directly to estimate an adaptive

1http://www.sogou.com/labs/resources.html

2

n-gram model [11, 12], which is interpolated with the generic LM for the next-pass recognition. This

method is usually based on a multi-pass (e.g., two-pass) recognition framework. In handwriting recognition,

unsupervised LMA has been reported only in recent years, and the adaptive LM is very simple. For example,

Xiu and Baird [13] applied the multi-pass recognition strategy iteratively to adapt a word lexicon, and

Lee and Smith [14] iteratively modified uni-gram probabilities further for English whole-book recognition,

where the texts are very long.

Another difficulty of LMA is that the transcripts of handwritten texts are usually short. There are

usually only a few hundreds of characters in a handwritten text (e.g., Fig. 1), impeding the direct adaption

of the lexicon or n-gram probabilities. In such situation, the interpolation of models from a pre-defined

multi-domain LM set is usually used in speech recognition, and the interpolation weights are learned by

the maximum a posteriori (MAP) estimation from the held-out data similar to the task domain (supervised

LMA, e.g., [5, 15]), or the recognized texts (unsupervised LMA, e.g., [16]) or both (e.g., [17]). This solution

also brings a problem of the large storage size of the LM set. To overcome this problem, the size of each

LM is usually reduced by entropy-based pruning [18] and quantization-based compression [19].

In this paper, we propose an unsupervised LMA framework for HCTR. For no prior information of

the document to recognize (test document) is available, we use a two-pass recognition strategy. In con-

sideration of various and short handwritten texts, we dynamically generate an adaptive LM to match the

first-pass recognized text of each document using a pre-defined multi-domain LM set. We propose three

methods to generate adaptive LMs, namely model selection, model combination and model reconstruction.

The model selection method is to select the best LM according to the minimum perplexity criterion. In

model combination, we estimate the combination weights by minimizing the sum of squared error (MSE)

with both L2-norm and L1-norm regularization. By model reconstruction, the adaptive LM is constructed

by a group of orthogonal bases. To make the adaption approach practical, we also consider the reduction of

computational cost and storage space. To speed up the two-pass recognition, we store the candidate charac-

ter classes of each candidate character pattern in the first-pass recognition to avoid repeated classification in

the second-pass. To reduce the storage size of LM set, we compress the LMs using split vector quantization

(SVQ) and principal component analysis (PCA). Finally, we evaluated the recognition performance on two

3

(a)

(b)

Figure 1: Two examples of handwritten Chinese text. (a) modern domain text; (b) ancient domain text.

public Chinese handwriting databases CAISA-HWDB [20] and HIT-MW [21], and showed large improve-

ment of performance gained by the proposed unsupervised LMA methods with comparable computational

cost to the baseline system.

Unlike previous works on unsupervised LMA in speech recognition (e.g., [17]), we regard the language

model combination as a linear regression problem. We learn combination weights via minimizing an error

cost function (i.e., MSE) and further consider model sparsity by adding L1-norm regularization, which is

totally different from the MAP or MBR (minimum Bayes’ risk) based framework [17]. The MAP estimation

is based on the perplexity and MBR takes into account the acoustic model under supervised framework,

and both get a local optimal solution by Baum-Welch algorithm, while the MSE-based method aims at

error loss minimization on the recognized text with a global optimal solution. Another contribution of this

paper is the new LMA method based on model reconstruction by applying PCA to a pre-defined LM set.

This idea is motivated by a technique developed by Sirovich and Kirby [22] for efficiently representing

4

images of human faces using PCA, and it has also been successfully used in image processing like active

shape models [23]. More importantly, the focus of this work is to investigate the role of LMA in Chinese

handwriting recognition, which, to our best of knowledge, has not been investigated in depth in handwriting

recognition field. A preliminary conference version of the paper was presented in [24], and this extended

version provides more detailed descriptions, presents additional LMA methods and a significantly extended

experimental validation. By customizing the handwriting recognition algorithm, the proposed approach can

also apply to the recognition of the documents of other languages (such as English and Arabic).

The rest of this paper is organized as follows: Section 2 reviews some related works; Section 3 gives an

overview of our HCTR system with unsupervised LMA; Section 4 briefly describes the statistical language

models used in this paper; Section 5 describes in details the unsupervised LMA methods of model selection,

model combination and model reconstruction; Section 6 introduces the LM compression methods; Section 7

presents the experimental results, and Section 8 draws concluding remarks.

2. Related Works

The large variability of domains across different handwritten texts makes accurate language modeling

a challenge. Language model adaptation (LMA) is a process that adapts the language model to match

the domain of each recognition task. However, LMA has been rarely studied in handwriting recognition.

Recently, Xiu and Baird [13] adapted a word lexicon from the previous recognized text for English whole-

book recognition, and Lee and Smith [14] further modified word uni-gram probabilities in caches. In

Chinese handwriting recognition, we reported our first attempt to unsupervised LMA by simple model

combination method [24], then a similar work was presented in [25], where the estimation of combination

weights is very sensitive to the string length.

Although few works of LMA in handwriting recognition have been reported, many studies have been

conducted for LMA in speech recognition [5, 17] and natural language processing (NLP) [26, 27], which

can be categorized into supervised and unsupervised LMA. Supervised LMA assumes that the topic in-

formation of recognition task is known in advance, and meanwhile, a set of domain-specific held-out data

5

is available to either learn the interpolation weights of several pre-defined LMs (e.g., [5, 15]) or train a

domain-specific LM to interpolate with the generic LM (e.g., [26, 28]). In contrast, various unsupervised

methods utilize recognized text to get an adaptive LM, which can be grouped into three categories.

• Latent topic analysis. Such method views the recognized text as a document and calculates the

probability of a word given the document using latent semantic analysis (LSA) [5, 8] or its prob-

abilistic extensions: probabilistic LSA (PLSA) and latent Dirichlet allocation (LDA) [9, 10], then

this probability is seen as an adaptive uni-gram model to interpolate with the generic LM for further

recognition.

• Estimating an n-gram model. If the recognized text is long enough, it can be directly used to train

an adaptive n-gram model [11, 12]; otherwise, it can be used as queries to select relevant data from a

large general corpus via information retrieval (IR) techniques, to train an adaptive n-gram model [29].

Finally, this adaptive n-gram model is also interpolated with the generic LM for further recognition.

• Mixture models. Instead of training a domain-specific LM to interpolate with the generic LM, the

recognized text is used to learn the interpolation weights of mixture models, where each component

LM is related to one pre-defined domain [16, 17]. The interpolation weights are typically estimated

by the maximum a posteriori (MAP) method [17]. The Bayesian method has been used in the mobile

speech recognition system when the prior probability of each domain is known [16].

Both supervised and unsupervised LMA improve the recognition performance, and Wang and Stolcke [30]

combined such two methods producing an additional gain. However, unsupervised LMA is more relevant

to real applications where the topic is unknown a priori.

In addition, LMA is closely related to transfer learning [31] in the field of machine learning, which tries

to give a solution to the mismatch between training data and test data in either the feature representation or

the samples distribution. The distribution mismatch is the concern of LMA methods, because the statistical

LM is actually a probability distribution on character or word sequences [5].

With text corpora increasingly available, the statistical LM directly trained from the corpora can contain

billions of n-grams. Many efforts have been made to reduce the storage of such models. On one hand,

6

several methods focus on how to remove the useless n-grams [18, 32]. On the other hand, more compact

representations of n-gram models without removing any n-grams are studied [19, 33, 34], such as trie

structure of n-gram table and data quantization of n-gram values (i.e., probabilities and back-off weights).

While these methods encode each n-gram separately, the quantization of groups of n-grams can further

compress the LM efficiently. The split vector quantization (SVQ) technique splits high-dimensional vectors

into low-dimensional ones and compresses them by vector quantization. It was originally developed in

speech recognition [35], and has been successfully used to compress a Gaussian classifier in handwriting

recognition [36]. This technique, however, has not been evaluated in compressing statistical language

models.

3. System Overview

The baseline handwriting recognition system without LMA is introduced in our previous paper [4],

which is based on the integrated segmentation-and-recognition framework with character over-segmentation.

For no prior information of each handwritten text is available, we hereby use a two-pass recognition strategy

for LMA. In the first-pass recognition, a generic LM is used to get a preliminary recognized text, then the

text is used to get an adaptive LM for the second-pass recognition.

Figure 2 illustrates the block diagram of the complete system with LMA, where seven steps are included

in the first-pass recognition:

(1) The document image is segmented into text lines;

(2) Each line image is over-segmented into a sequence of primitive segments (Fig. 3(a)) such that each

segment is a character or a part of a character;

(3) One or more consecutive segments are concatenated to generate candidate character patterns (Fig. 3(b));

(4) Each candidate pattern is classified to assign several candidate character classes, forming a character

candidate lattice (Fig. 3(c));

(5) Each sequence of candidate character classes is matched with a lexicon to segment into candidate

7

Figure 2: System diagram for handwritten Chinese text recognition with LMA.

words2, forming a word candidate lattice (Fig. 3(d));

(6) Each character or word sequence C paired with candidate pattern sequence Xs (this pair is called a

candidate segmentation-recognition path) is evaluated by fusing multiple contexts, and the optimal

path is searched to give the segmentation and recognition result;

(7) The recognized results of all text lines are concatenated to give the document transcript (recognized

text), which is used for LMA in the second-pass recognition or output as the final result.

In the above, only the last three steps are needed in the second-pass recognition, because the character

candidate lattice (all candidate character classes) of each text line is independent of the LM, and can be

stored (use little memory) to avoid repeated character classification after the first-pass recognition. Note

that the classification of all candidate character patterns dominates the whole recognition time (see Table 4),

by saving this, the two-pass recognition system has little additional cost compared to the baseline system.

In this work, we evaluate each segmentation-recognition path using the function of weighting with

2In Chinese, a word may comprise one or multiple characters, which can explore both syntactic and semantic meaning better than

a character.

8

(a) (b)

(c) (d)

Figure 3: (a) Over-segmentation; (b) Segmentation candidate lattice; (c) Character candidate lattice of a segmentation (thick path) in(b); (d) Word candidate lattice of (c).

character pattern width (WCW), which integrates the character classification score, geometric context and

linguistic context [4]:

f (Xs,C) =m∑

i=1

(wi · lp0i +

4∑j=1

λ j · lp ji ) + λ5 · log P(C), (1)

where m is the number of characters in the path Xs, and wi is the width of the i-th character pattern after

normalizing by the estimated height of the text line. lp0i = log p(ci|xi) is the character classification score

for candidate character class ci of the i-th character pattern, which is calculated by the confidence transfor-

mation of the character classifier outputs. lp1i = log p(ci|guc

i ), lp2i = log p(ci−1ci|gbc

i ), lp3i = log p(zp

i = 1|guii ),

and lp4i = log p(zg

i = 1|gbii ) are four geometric model scores of the i-th character pattern, where guc

i , gbci ,

guii and gbi

i represent geometric features of unary class-dependent (e.g., character position), binary class-

dependent (e.g, distance between two consecutive characters), unary class-independent (e.g., size of bound-

ing box) and binary class-independent (e.g, gap between two consecutive bounding box), respectively.

zpi = 1 and zg

i = 1 mean a valid character pattern and valid between-character gap, respectively. log P(C)

denotes the linguistic context score of this path. The combining weights λ j, j = 1, . . . , 5 are optimized by

Maximum Character Accuracy (MCA) training. Under this path evaluation function, we use a refined beam

search method to efficiently find the optimal path. All the details of these techniques can be found in our

previous paper [4], and we focus on how to get the adaptive language model in this paper, i.e., the part

surrounded by dotted-line in Fig. 2.

9

4. Statistical Language Model

In the path evaluation function Eq. (1), the linguistic context score log P(C) plays a very important role,

which is usually given by a statistical language model. The most popular language model is the n-gram

model [37], where n is called the order of the model. Such model characterizes the statistical dependency

between n characters or words. In consideration of model complexity, the order n usually takes 2 or 3,

meaning the bi-gram and tri-gram model, respectively. In this paper, we evaluate five types of n-gram

models, which have been successfully used in our baseline system [4], namely, character bi-gram (cbi),

character tri-gram (cti), word bi-gram (wbi), word class bi-gram (wcb), and interpolating word and class

bi-gram (iwc). All these models are summarized in Table 1, where C =< c1 . . . cm > is a character sequence,

and m is the character number of C. In word level, C is segmented into a word sequence C =< w1 . . .wl >,

where l is the word number, and Wi is the word class of the word wi, λ6 is learned together with the weights

in Eq. (1) by MCA training [4].

Table 1: The five types of n-gram models used in our system.

level n-gram formula

character cbi Pcbi(C) =m∏

i=1p(ci|ci−1)

cti Pcti(C) =m∏

i=1p(ci|ci−2ci−1)

word wbi Pwbi(C) =l∏

i=1p(wi|wi−1)

wcb Pwcb(C) =l∏

i=1p(wi|Wi)p(Wi|Wi−1)

iwc log Piwc(C) = log Pwbi(C) + λ6 · log Pwcb(C)

The probabilities of an n-gram model are usually estimated from a large text corpus using maximum

likelihood estimation with smoothing techniques [38]. Smoothing is used to overcome the problem of

zero probabilities of the unseen n-grams in the training corpus. Many smoothing techniques have been

proposed in the literature [38], but none of them is systematically superior to the others in handwriting

recognition [39, 40]. We use the Katz back-off smoothing method [38, 41], which is the default method in

SRILM [7] toolkit. The back-off means that, when a higher order n-gram (< ω1, . . . , ωn >) does not appear

10

in the corpus, the probability estimation is back to a lower order n-gram calculation. It is formulated by

p(ωn|ωn−11 ) =

p∗(ωn|ωn−1

1 ), if C(ωn1) > 0,

α(ωn−11 ) · p(ωn|ωn−1

2 ), if C(ωn1) = 0,

(2)

where ωn−11 =< ω1, . . . , ωn−1 > is the n− 1 history characters (words) of ωn, C(·) is the number of times the

argument is counted in the corpus, p∗(·) is the Good-Turing discounting probability [38, 41] and α(·) is the

scaling factor to ensure the smoothed probabilities p(·) normalized to one.

Obviously, an n-gram model has two categories of parameters besides the n-gram table: discounting

probabilities p∗(·) and scaling factors α(·), and only the appeared n-grams are needed. For example, a bi-

gram model needs to store only the probabilities of all appeared n-grams (i.e., bi-grams and uni-grams) and

the scaling factors of all appeared uni-grams. However, there are still too many n-gram parameters due to

the large number of Chinese characters (more than 7,000 in our system) and words (about 0.3 million). In

our system, we use the entropy-based pruning algorithm [18] to remove those n-grams raising the perplexity

(due to pruning them) less than a threshold, which is set empirically as 5 × 10−8, 10−7 and 10−7 for cbi, cti

and wbi, respectively [42]. Since the word class number (1,000 in our system) leads to a moderate model

size, the parameters of wcb are not pruned.

In addition, we evaluate the performance of a language model using the perplexity (PP) [41], which is

calculated as the reciprocal of geometric average probability:

PP(C) = P(C)−1m =

1m√∏m

i=1 p(ci|hi), (3)

where C is a measured text containing m characters (words), hi is the history characters (words) of ci, and

p(ci|hi) is given by the evaluated language model (e.g., bi-gram: p(ci|hi) = p(ci|ci−1)). The perplexity is

similar to the negative log-likelihood of the language model on the text C. They show that lower perplexity

indicates a better model.

Each n-gram model above (e.g, cbi, cti.) can be seen as a discrete probability distribution on all n-

grams, which can be represented as a vector with the dimensionality as the number of all n-grams. This

concept of vector representation will be adopted in the following sections.

11

5. Language Model Adaptation

This section presents three language model adaptation (LMA) methods, which are necessary when the

generic language model (LM) does not match well the handwritten text. Because the domain of handwritten

text is variable and unknown a priori, we use a two-pass recognition strategy for unsupervised adaptation

of language model, which is described as follows:

Two-Pass recognition for unsupervised LMA

(1) Use a generic language model (LM0) to recognize a document to obtain a preliminary transcript C;

(2) Generate an adaptive language model (LM∗) that best matches the preliminary transcript;

(3) Use LM∗ to recognize the document again to obtain the final transcript.

More passes of recognition can be considered, but our experiments show that this does not improve the

recognition performance further. In the above, the Step 2 plays a key role in the whole process, and we will

describe how to generate the adaptive LM via three methods, namely, model selection, model combination

and model reconstruction.

5.1. Model Selection

In the two-pass recognition strategy, we can get the transcript of each document after the first-pass

recognition. Although this transcript is a very direct domain-specific data, the number of characters (words)

is too small (specifically, 200 - 300 characters in each document of our database, and words are fewer) and

contains recognition errors. However, it does carry the domain information of the document, and we can

use this information to get a matched language model.

Meanwhile, we can prepare a language model set ({LM1, LM2, . . . , LMK}, K is the number of pre-

defined domains), and each language model (LMk) corresponds to one specific domain (e.g., sport, business,

health), which is trained from a large text corpus (Tk) relevant to this domain. All of these text corpora

can be easily obtained from the Internet (e.g., the resources from the Sogou Labs). Moreover, we have a

language model (LM0) for the general domain, which is also used in the first-pass recognition. Finally, we

have K + 1 language models in total.

12

Since the perplexity measures the quality of one LM, we can straightforwardly choose the best one with

the minimum perplexity from the pre-defined multi-domain LM set (including the generic LM):

k∗ = arg mink

PPk(C), 0 ≤ k ≤ K, (4)

where PPk(C) is the perplexity of the k-th language model (LMk) on the first-pass transcript C. According

to the definition of perplexity Eq. (3), this criterion is equivalent to maximizing the log-likelihood. This

method works under the assumption that the domain in one document is discrete and match well with one

pre-defined LM.

5.2. Model Combination

Model selection is straightforward to select one model from the pre-defined multi-domain LM set, but

such a LM set cannot cover all possible domains in practice. When the domain of handwritten text is

a hybrid of several domains or out of the pre-defined set, such method will fail. To overcome this, we

generate a new adaptive LM via linear combination of the pre-defined multiple LMs:

p(ωn|ωn−11 ) =

K∑k=0

βk · pk(ωn|ωn−11 ), (5)

where pk(·) is calculated by the k-th language model (LMk), and the parameter βk serves as the combination

weight to control the relative importance of LMK . To reduce the computational cost, we can consider a

reduced number of LMs with lower perplexities while viewing the remaining LMs irrelevant to the test

document. Given the pre-defined multiple LMs, the key issue of model combination is to estimate the

combination weights to match the test document, and we propose three estimation methods in the following.

We first introduce a heuristic method to estimate these weights. The perplexity Eq. (3) denotes the

goodness of the corresponding LM fitting the transcript C, and the lower perplexity means a better model.

Therefore, the weight of one LM should be proportional to the reciprocal of perplexity:

βk ∝1

PPk(C), k = 0, 1, . . . ,K. (6)

In this way, we can get the weights by normalizing the reciprocals of perplexity into unity:

βk =

1PPk(C)∑K

i=01

PPi(C)

, k = 0, 1, . . . ,K. (7)

13

Like the model selection method Eq. (4), this method only needs to calculate the perplexity of each LM,

which is fast to implement. Actually, it is similar to the weights estimation method via string probability

in [25], where P(C) was used instead of 1PP(C) in Eq. (7). Unlike the perplexity PP(C), the value of P(C)

(see Table 1) is not normalized with respect to the length of sequence, and thus is very sensitive to the string

length.

Alternatively, it is possible to learn these weights using machine learning algorithms by viewing this

problem as a linear regression model. Given the first-pass transcript C, we can extract all the n-grams in

the C to form a set of training samples. Each sample is an n-gram (< ω1, . . . , ωn >), and its features are

the corresponding K + 1 LM probabilities (x = (x0, x1, . . . , xK)T , where xk = pk(ωn|ωn−11 ) by the k-th LM,

k = 0, 1, . . . ,K). For each sample, we set the target probability of the n-gram as t = 1 to hope that such n-

gram will be chosen in the second-pass recognition (the path evaluation function tends to choose the n-gram

with higher probability). Meanwhile, an estimated value is given by y =∑K

k=0 bT x according to Eq. (5),

where b = (β0, β1, . . . , βK)T . On obtaining the training samples of the test document, the weights can be

learned by minimizing the sum of squared error (MSE):

minb

F(b) =N∑

i=1

(ti − bT xi)2, (8)

where N is the sample number, and the vector xi = (xi0, xi1, . . . , xiK)T denotes the features of the i-th sample.

To alleviate the over-fitting, a regularization term is usually added in Eq. (8) to constrain the parameters,

leading to a modified error function:

minb

F(b) =N∑

i=1

(ti − bT xi)2 + λ · ∥b∥22, (9)

where the hyper-parameter λ (λ ≥ 0) governs the relative importance of the regularization term. In this

formulation, the computation of b is a quadratic programming problem, and has a closed-form solution:

b = N∑

i=1

xixTi + λI

−1 N∑i=1

tixi, (10)

where I is an identity matrix. For the tradeoff parameter λ, considering the influence of data scaling, we

14

suggest to set λ as

λ =λ̃

K + 1

K+1∑j=1

|M j j|, (11)

where M =∑N

i=1 xixTi , and λ̃ is selected on a validation data set.

The regularization term in Eq. (9) is the L2-norm penalty, and such model is also known as ridge

regression. Alternatively, we also evaluate the L1-norm regularization for model sparsity:

minb

F(b) =N∑

i=1

(ti − bT xi)2 + λ · ∥b∥1. (12)

This model is also known as Lasso regression. It can be solved by the coordinate-wise descent algo-

rithm [43]:

βk(λ)← S

βk(λ) +N∑

i=1

xik(ti − yi), λ

, (13)

where yi = bTxi, and S (·) is a soft-thresholded function defined by

S (β̃, λ) =

β̃ − λ, if β̃ > 0 and λ < |β̃|,

β̃ + λ, if β̃ < 0 and λ < |β̃|,

0, if λ ≥ |β̃|.

(14)

The update Eq. (13) is repeated for k = 0, 1, . . . ,K, 0, 1, . . . until convergence, and each update is very

quick in our experiments, because the sample number (N, i.e., the number of n-grams) is usually small

in one document. Compared to the L2-norm regularization, the L1-norm usually yields a sparse solution

(some learned weights are zero). This means that fewer language models are combined, which makes the

system faster.

5.3. Model Reconstruction

Viewing one pre-defined LM as a sample, the above methods of LM adaptation by either selecting one

sample or combining several samples can be seen as a joint representation for the LM set (sample set). In

this section, we propose another parametric form using a group of orthogonal bases to reconstruct a LM:

s = µ + Urvr. (15)

15

This idea is motivated from the work [22] for representing images of human faces, and it has been success-

fully used in other areas like active shape models [23].

In the above, the term s ∈ ℜd is a reconstructed sample, denoting a new LM (adaptive LM), with each

element representing the probability value of an n-gram, and the dimensionality d is the size of the n-gram

list shared in the orthogonal space. The term µ = 1Ns

∑Nsi=1 si is the sample mean, and Ns is the sample

number (herein, the number of pre-defined LMs, i.e., Ns = K + 1). The r columns of matrix Ur ∈ ℜd×r

are the orthogonal bases obtained by applying principal component analysis (PCA) to the sample set, and

vr ∈ ℜr denotes the coefficients of the sample’s projection on the orthogonal space.

The computation of the orthogonal bases follows the conventional PCA, which denotes that the r

columns of matrix Ur are the first r eigenvectors of the sample covariance matrix Σ = 1K+1 S̄ S̄ T , where

S̄ =[s0 − µ, . . . , sK − µ

]. One might wonder that how to efficiently get the eigenvectors of the large-

scale matrix S̄ S̄ T ∈ ℜd×d (d is the order of million). Fortunately, the dimensionality of the matrix

S̄ T S̄ ∈ ℜ(K+1)×(K+1) is very small (i.e., dozen order), and its eigenvectors can be calculated quickly. Ac-

cording to the following equations (16) - (18), the eigenvector x of S̄ T S̄ corresponding to eigenvalue ξ can

be easily transformed to the required eigenvector u of S̄ S̄ T by u = 1√ξS̄ x.

S̄ T S̄ x = ξx, (16)

S̄ (S̄ T S̄ )x = S̄ (ξx), (17)

(S̄ S̄ T )(S̄ x) = ξ(S̄ x). (18)

Now we focus on how to construct a new sample (i.e., adaptive LM) in such orthogonal space to match

the test document. Like the linear regression model in Section 5.2, we get adaptive coefficients vr by

minimizing the sum of squared error with L2-norm regularization:

minvr

F(vr) =N∑

i=1

[ti − (µπ(i) + uTπ(i)vr)]2 + λ · ∥vr − v0∥22, (19)

where N is the number of n-grams in the first-pass transcript C, and ti = 1 is the target value for the i-th

n-gram. The subscript π(i) is the index of the i-th n-gram from the C in the shared n-gram list3, and uTπ(i) is

3If this n-gram is not in the list, it is not used. So the number N is usually smaller than the length of transcript C here.

16

the π(i)-th row of the bases matrix Ur. In the regularization term, v0 represents the projected vector of the

sample s0 in the orthogonal space: v0 = UTr (s0 −µ), where s0 is the language model via the model selection

method. This means that we hope the reconstructed LM is attracted to the vicinity of the model selection

result. Similar to Eq. (10), we also have a closed-form solution:

vr =

N∑i=1

uπ(i)uTπ(i) + λI

−1 N∑i=1

(ti − µπ(i))uπ(i) + λv0

, (20)

and the tradeoff parameter λ is also adjusted by considering the scaling of different handwritten texts:

λ =λ̃

r

r∑j=1

|M j j|, (21)

where M =∑N

i=1 uπ(i)uTπ(i), and λ̃ is selected on a validation data set. Since there is no sparse property in the

coefficients in vr, we do not try L1-norm regularization here.

Compared to the above LMA methods of either model selection or model combination, model recon-

struction by the orthogonal bases can generate a more flexible LM. However it needs a longer handwritten

text to estimate the optimized coefficients for an adaptive LM, and is limited to the shared n-gram list.

6. Language Model Compression

The above LMA methods depend on a LM set including K+1 LMs, which poses a challenge of storage.

Although we pruned each LM to a moderate size using entropy-pruning [18], the storage size of K + 1

models is still considerable for practical applications. Each LM contains two parts, namely, the n-gram

table and the probability values of each n-gram. The n-gram table is not considered in the following,

because it is fixed in every LM. In this section, we introduce two methods to compress the storage size of

probability values without removing any n-gram. The first method uses split vector quantization (SVQ) to

compress each LM separately, while the other one uses PCA to compress the whole LM set jointly.

6.1. SVQ Compression

In each language model, there are many similar probability values, such as the probabilities of similar

n-grams. These probabilities can be clustered to a group of prototypes with little loss of precision. This

17

motivates us to use the SVQ technique [36] to compress each LM. Figure 4 shows the diagram of this

method for one vector formed by the probability values of all n-grams.

Figure 4: Diagram of split vector quantization (SVQ) compression.

According to the back-off representation of an n-gram model Eq. (2), these parameters can be repre-

sented by a vector set ϕ = {p∗n,p∗n−1,αn−1, . . . ,p∗1,α1}, where each vector represents one group of parameters

(e.g., p∗n−1 represents probabilities of (n − 1)-grams and αn−1 represents scaling factors of (n − 1)-grams),

and can be compressed by SVQ. Let’s look at one example of SVQ compression for the vector p∗n, and the

compression of the other vectors is similar. We first split the original high-dimensional vector into a low-

dimensional subspace, i.e., the original d-dimensional vector (p∗n) is equally partitioned into Q sub-vectors

of dQ-dimensionality (p∗n1,p∗n2, . . . ,p

∗nQ), where d = Q × dQ

4. Then, these sub-vectors are clustered into a

small set of L prototypes by the k-means clustering algorithm to form a codebook. The codebook as well

as the corresponding indices of prototypes for all sub-vectors are stored for LM reconstruction. We can see

that the data quantization method in [19] is a special case of the SVQ compression by setting dQ = 1.

During the recognition process, the probability of one n-gram in a sub-vector is mapped by the corre-

sponding element of the prototype according to the index of the sub-vector. Lower mapping error can be

ensured by using more prototypes (L) and lower dimensionality of the subspace (dQ), which will lead to a

larger storage size, however. In our experiments, we found that dQ = 2 and L = 256 for the compression of

each LM leads to a good tradeoff between the size and recognition performance.

4when d is not the integer times of dQ, some dummy elements can be added to make d the integer times of dQ.

18

6.2. PCA Compression

Representing multiple LMs as a data matrix S = [s0, s1, . . . , sK], where the k-th column (sk ∈ ℜd)

represents the n-gram probabilities from the k-th LM on a shared n-gram list. The shared list is constructed

from all text corpora, and the size is referred to d (i.e., the number of all n-grams). Typically, the numbers

K and d are the orders of dozen and million, respectively. In such a matrix, there are many correlated

elements, e.g., the values of common n-grams in different LMs or similar n-grams in one LM. We use the

PCA technique to remove such correlations and compress the redundancy. Figure 5 shows the diagram of

the PCA compression.

Figure 5: Diagram of principal component analysis (PCA) compression.

A data vector s can be projected onto an orthogonal space of r-dimensionality by PCA:

vr = UTr (s − µ), (22)

where the projected vector vr denotes the coefficients on the first r principal components, µ is the sample

mean vector, and Ur is the same bases matrix as in section 5.3. Giving the coefficients vector vr, the

original data vector can be easily approximated according to Eq. (15). Finally, all the K + 1 samples can be

approximated by the corresponding K + 1 coefficients vectors, r bases vectors and one sample mean vector,

as illustrated in Fig. 5.

Note this PCA compression does not work for word class bi-gram (wcb) models, because it is impossible

to construct the shared list of word class bi-gram from all text corpora (the class index represents different

words in different wcb models).

Moreover, we can compress each vector (e.g., mean vector, eigenvector) further by the SVQ method,

which gives a combination of PCA and SVQ compression to produce a much smaller storage size.

19

7. Experimental Results

We evaluated the performance of our unsupervised LMA approaches on two databases: a large database

of unconstrained Chinese handwriting, CASIA-HWDB [20], and a small data set, HIT-MW [21], both are

free to download for research5. All the experiments were run on a desktop computer with 3.10 GHz CPU,

programming using Microsoft Visual C++ 2005.

7.1. Database and Experimental Setting

The CASIA-HWDB database contains both isolated characters and unconstrained handwritten texts, and

is divided into a training set of 816 writers and a test set of 204 writers. The training set contains 3,118,447

isolated character samples of 7,356 classes and 4,076 pages of handwritten texts (including 1,080,017 char-

acter samples). We tested on the 1,015 handwritten pages in the test set, which were segmented into 10,449

text lines and there are 268,629 characters. That is to say, each page contains 265 characters on average.

The HIT-MW data set has a test set of 383 text line images containing 8,448 characters (on average, 22

characters in each text line), and each text line is treated as a handwritten text page in this set.

One typical HCTR system includes the models of character classifier, geometric model and language

model, together with the combination weights of these models. In our system, we use a modified quadratic

discriminant function (MQDF) [44] as the character classifier on the normalization-cooperated gradient fea-

tures (NCGF) [45] extracted from each gray-scale character image. The parameters of MQDF were learned

from 4/5 samples of training set (including the isolated samples and the character samples segmented from

the text pages), and the remaining 1/5 training samples were used for confidence parameter estimation. For

parameter estimation of the geometric models, we extracted geometric features from all the 41,781 text lines

of training text pages. The generic language models were trained on a large general text corpus containing

about 50.8 million characters (about 32.7 million words) from the Chinese Linguistic Data Consortium.

On obtaining such context models, the combining weights were learned on 300 training text pages. These

settings are the same as those in our previous work [4]. Another 200 training text pages were used as the

5http://www.nlpr.ia.ac.cn/databases/handwriting/Home.html and http://code.google.com/p/hit-mw-database/.

20

held-out data to set the hyper-parameters, e.g., regularization term weight λ̃ in Eq. (11) and the number of

principal components r in PCA. Finally, the obtained parameter values are listed in Table 2, where the value

in the parentheses of the last row is for the weight λ6 in iwc model (see Table 1). Recall that PCA does not

work for wcb as explained in Section 6.2, its values are denoted as ’N/A’ in Table 2.

Table 2: Parameter values obtained by MCA training [4] or cross-validation.

context weights in Eq. (1) regularization term weight # of principalmodel λ1 λ2 λ3 λ4 λ5 λ̃ (L2) λ̃ (L1) λ̃ (PCA) components r

cbi 0.021 0.088 0.151 0.073 0.184 1.0 0.10 15 6cti 0.021 0.084 0.147 0.084 0.186 0.3 0.10 10 3wbi 0.031 0.011 0.178 0.160 0.163 1.0 0.15 10 3wcb 0.030 0.010 0.170 0.163 0.154 4.0 0.70 N/A N/Aiwc 0.034 0.013 0.177 0.127 0.085 (0.083) 1.0 0.15 10 3

To prepare a pre-defined set of LMs to match different handwritten texts, we extracted 14 corpora of

various domains from the web pages provided by the Sogou Labs. All the texts were segmented into word

sequences using the ICTCLAS6 toolkit for word-level LMs, and further, we clustered the words into a

number of word classes using the algorithm of [46]. In addition, an ancient domain corpus (without word

segmentation due to the unavailability of ancient domain word lexicon) was collected from the Internet.

Table 3 shows the statistics of characters, words, character classes, word classes and word clusters in each

corpus. We can see that the corpus of news domain is the largest, which includes about 418 million charac-

ters and 265 million words, and is even much larger than the general one. On the other hand, the texts of

ancient domain are much fewer, but about 8.22 million characters are enough to get an appropriate character

bi-gram or tri-gram model using the SRILM [7] toolkit.

We evaluate the recognition performance using two character-level accuracy metrics as in our previous

paper [4]: Correct Rate (CR) and Accurate Rate (AR):

CR = (Nt − De − S e)/Nt,

AR = (Nt − De − S e − Ie)/Nt,

(23)

where Nt is the total number of characters in the ground truth transcript. The numbers of substitution errors

(S e), deletion errors (De) and insertion errors (Ie) are calculated by aligning the recognized result string with

6Institute of Computing Technology, Chinese Lexical Analysis System: http://ictclas.org/

21

Table 3: Statistics of characters, words, character classes, word classes and word clusters in each corpus.

domains LMs characters (million) words (million) character classes word classes word clustersgeneral LM0 50.8 32.7 7356 281,680 1000news LM1 418 265 6699 454,370 1000

business LM2 333 202 6474 473,792 1000sport LM3 227 149 6789 234,130 1000house LM4 118 73.4 6231 254,659 1000

entertain LM5 106 71.5 5882 144,246 500it LM6 54.1 33.1 5628 156,728 500

Olympic LM7 52.2 33.0 6048 144,390 500women LM8 44.0 29.5 5569 94,409 350

auto LM9 32.4 20.4 5153 105,487 500travel LM10 31.4 20.1 5755 133,731 500health LM11 31.2 20.2 5590 99,207 350

learning LM12 28.0 17.5 5548 104,282 350culture LM13 20.0 13.4 5791 104,162 350military LM14 15.3 9.47 4854 69,683 250ancient LM15 8.22 — 7318 — —

the ground truth transcript by dynamic programming. Vinciarelli et al [39] suggested that the AR (called

recognition rate there) is an appropriate measure for document transcription, while CR (called accuracy rate

there) is a good metric for tasks of content modeling (e.g., document retrieval).

7.2. Baseline System Performance

To show the effectiveness of unsupervised LMA in HCTR, we first give the results of our baseline HCTR

system which does not use any LMA techniques. In this system, the maximum number of concatenated

segments, candidate number of character classification and beam width in the refined beam search algorithm

(without CCA: candidate character augmentation) are set as 4, 20 and 10, respectively, and the results

of our beam search method were shown to be very close to the optimal solution guaranteed by dynamic

programming under the baseline path evaluation function [4]. To show the importance of various domain

texts, we also trained a generic LM from the union of all text corpora in Table 3, besides that only from the

general text corpus, and the recognition results of all LM types on the CASIA-HWDB test set are shown in

Table 4.

From Table 4, we can observe three points. First, LM plays an important role in HCTR system, which

improves the recognition accuracy more than 10 percents (”w/o” means the results without any LM.). Sec-

ond, the recognition time on 1,015 test pages is reduced drastically from 9.72h to 0.04h if excluding the

22

Table 4: Recognition results of baseline system without LMA on CASIA-HWDB test set.

LM from general text corpus LM from all text corporaLMs CR(%) AR(%) Time(h) CR(%) AR(%) Time(h)w/o 79.43 77.34 0.04 (9.72)cbi 90.28 89.57 0.08 90.52 89.78 0.08cti 90.81 90.21 0.14 91.28 90.66 0.15wbi 90.99 90.34 0.62 91.42 90.76 0.65wcb 90.81 90.11 0.60 91.27 90.57 0.65iwc 91.22 90.58 0.65 91.63 90.98 0.67

times of over segmentation and character classification, indicating that our path search process is very ef-

ficient. Hence, by storing all character classification results (about 200KB memory) after the first-pass

recognition, we can make our LMA framework of two-pass recognition have little additional computational

cost. To show the effects of different language models, we only report the time on the part of path search in

the following experiments. Last, the LM trained from the union of all text corpora yields higher recognition

accuracy than that only from the general corpus, though it has much larger size due to the increased number

of n-grams. In the following, the baseline results refer to those of the generic LM only from the general text

corpus by default. We will see that LMA leads to higher performance than the generic models from both

the general text corpus and the union of all corpora.

7.3. Results of Language Model Adaptation

In this section, we first show the results of three unsupervised LMA methods of model selection, model

combination and model reconstruction, evaluated on the CASIA-HWDB test set. Because word-level LMs

are not available for ancient domain, we use the adaptive cti model instead of wbi, wcb and iwc model in

the second-pass recognition. In the end, we also report results on the HIT-MW test set using selected LMA

methods.

7.3.1. Results of LMA using Model Selection

Table 5 shows the results of LMA using model selection. Compared to the baseline results in Table 4,

both CR and AR are improved by the model selection adaptation for all LM types, and the improvement of

cti is the largest (about 1.1 percent up). Even compared to the baseline results of the generic LM trained

from all text corpora, the improvement by model selection adaptation is impressive. This demonstrates

23

the importance of domain-matched LM in the recognition of variable documents. On the other hand, the

processing time (path search part) is about doubled due to the two-pass recognition strategy.

Table 5: Results of LMA using model selection on CASIA-HWDB test set.

LMs CR(%) AR(%) Time(h)cbi 91.26 90.59 0.17cti 91.92 91.37 0.26wbi 91.91 91.3 1.03wcb 91.73 91.06 1.14iwc 92.14 91.53 1.21

7.3.2. Results of LMA using Model Combination

In Section 5.2, we introduced three methods to estimate the weights of model combination for LMA,

namely, heuristic method, minimizing squared error (MSE) with L2-norm and L1-norm regularization. The

recognition results on CASIA-HWDB test set are shown in Table 6.

Table 6: Results of LMA using model combination with all LMs on CASIA-HWDB test set.

heuristic method MSE with L2-norm MSE with L1-normLMs CR(%) AR(%) Time(h) CR(%) AR(%) Time(h) CR(%) AR(%) Time(h)cbi 91.33 90.63 1.01 91.68 90.78 0.99 91.63 90.71 0.33cti 92.20 91.62 2.50 92.42 91.72 2.53 92.32 91.64 0.80wbi 92.23 91.63 1.43 92.18 91.40 1.46 92.11 91.36 1.21wcb 91.82 91.12 1.54 91.70 90.82 1.68 91.67 90.89 1.30iwc 92.28 91.67 2.01 92.22 91.41 2.01 92.24 91.45 1.54

First, compared to the results of model selection adaptation in Table 5, we can see that the recognition

accuracy is improved further by model combination, especially the MSE method with L2-norm regular-

ization for character-level LMs (i.e., cbi and cti). This demonstrates the benefits of the proposed model

combination method, improving CR from 90.81% to 92.42% and AR from 90.21% to 91.72% for cti.

Next, we compare the results of three weight learning methods for model combination. Table 6 shows

that the results of MSE methods are better than those of heuristic method for character-level LMs (i.e.,

cbi and cti), whereas for word-level LMs (i.e., wbi, wcb and iwc), the heuristic method performs slightly

better. This is because the character sequence of the first-pass recognition output is more reliable than word

sequence with possible word segmentation errors, and there are usually more characters than words in the

recognized text. Thus, learning weights from such short recognized text for character-level LMs is more

24

robust than that for word-level LMs. Moreover, we can see that the results of two regularization methods in

MSE combination are comparable, while the L1-norm regularization shows the benefit of less processing

time because it selects fewer LMs to combine, justifying the model sparsity property.

Finally, to speed up the MSE with L2-norm regularization, we also evaluate the performance of combin-

ing fewer LMs, which are selected according to the minimum perplexity criterion Eq. (4). Figure 6 shows

the results of different numbers of LMs used in the model combination by MSE with L2-norm for all LM

types. We can see that combining three LMs gives a good tradeoff between the recognition accuracy and

speed. Further, we show the results of combining three selected LMs using both the heuristic method and

the MSE with L2-norm in Table 7. We can see that the recognition accuracies of combining three selected

LMs are comparable to those of combining all LMs in Table 6, while the processing time (path search part)

is reduced significantly, especially for the character-level LMs (reduced from 2.53h to 0.49h for cti).

(a) (b) (c)

Figure 6: The results of combining different numbers of LMs by the MSE with L2-norm regularization on CASIA-HWDB test set.

Table 7: Results of LMA using model combination with only three selected LMs on CASIA-HWDB test set.

heuristic method MSE with L2-normLMs CR(%) AR(%) Time(h) CR(%) AR(%) Time(h)cbi 91.48 90.81 0.26 91.69 90.85 0.26cti 92.27 91.71 0.51 92.37 91.73 0.49wbi 92.23 91.63 1.18 92.24 91.51 1.13wcb 91.90 91.22 1.23 91.83 91.05 1.26iwc 92.35 91.75 1.30 92.34 91.60 1.31

7.3.3. Results of LMA using Model Reconstruction

Table 8 shows the results of LMA using model reconstruction, where the number of principal compo-

nents was empirically set as 6, 3 and 3 for cbi, cti and wbi, respectively. Recall that we do not have a

shared n-gram list to construct the orthogonal bases for wcb, this method does not work for wcb. In the

25

iwc model, only the wbi is reconstructed, and the wcb uses the model of minimum perplexity. Compared

to the baseline results in Table 4, we can see evident improvement of recognition accuracy by model recon-

struction in HCTR. For the cti model, the CR is improved from 90.81% to 92.06%, and the AR is improved

from 90.21% to 91.49%. However, the performance of model reconstruction is inferior to that of the model

combination methods in Table 6, though it is slightly better than the performance of model selection in

Table 5.

Table 8: Results of LMA using model reconstruction on CASIA-HWDB test set.

LMs CR(%) AR(%) Time(h)cbi 91.39 90.71 0.23cti 92.06 91.49 0.44wbi 92.10 91.49 1.17iwc 92.14 91.55 1.32

7.3.4. Performance on the HIT-MW Test Set

Finally, we show the recognition results of our LMA methods on the HIT-MW test set, where each

document (a text line image) contains only 22 characters on average. According to the results on the larger

CASIA-HWDB test set, the model combination method is to use three selected LMs of lowest perplexi-

ties with the weights estimated by the MSE with L2-norm regularization (character-level models) and the

heuristic method (word-level models). All experimental setting is the same as that on CASIA-HWDB, and

the recognition results are shown in Table 9. We can see that all the LMA methods improve the recognition

accuracy, and again the model combination performs best (about 1.1 percent up of CR for cti), demonstrat-

ing the benefits of LMA even on short texts.

To verify the reliability of performance improvement, we give the results of statistical tests for compar-

isons of pairs of methods. The Wilcoxon signed-ranks test is claimed to be usually more powerful than the

t-test [47]. It ranks the differences in performances of two methods on each document, ignoring the signs,

and compares the ranks for the positive and negative differences. Here, we have four methods with 383

evaluations (CR values of the 383 documents in HIT-MW test set), and the results are shown in the ’s-test’

columns in Table 9 by the signs (’+’: significant and ’−’: non-significant at 0.05 significance level). Each

26

’s-test’ column shows the comparison results of that method with the methods in its left columns in turn.

We can see that all improvements by the LMA methods compared to the baseline system are significant

(the first sign in each column is ’+’), however, the differences of three LMA methods are not significant

(most second and third signs are ’-’) due to the insufficient estimation from these very short texts in model

combination and reconstruction.

Table 9: Results of LMA on the HIT-MW test set.

baseline model selection model combination model reconstructionLMs CR(%) AR(%) CR(%) AR(%) s-test CR(%) AR(%) s-test CR(%) AR(%) s-testcbi 91.70 90.79 92.25 91.30 + 92.52 91.41 +− 92.28 91.39 + − +cti 92.28 91.52 93.03 92.31 + 93.44 92.63 ++ 93.22 92.47 + − −wbi 92.46 91.50 93.03 92.10 + 93.37 92.48 ++ 93.21 92.26 + − −wcb 92.01 91.02 92.65 91.63 + 92.74 91.62 +− N/A N/A N/Aiwc 92.52 91.57 93.13 92.23 + 93.32 92.41 +− 93.12 92.22 + − −

7.4. Effects of Language Model Compression

The above experiments used the LMs without compression, which really consume large storage size,

especially for cti. Recall that in Section 6, we introduced the LM compression methods including SVQ of

each LM separatively, PCA reduction of whole LM set, and the combination of them (SVQ+PCA). In this

section, we evaluate these compression methods on the CASIA-HWDB test set, and the results are shown

in Table 10. Because PCA compression does not work for wcb, there are no results (denoted as ’N/A’) of

wcb for ’PCA’ and ’SVQ+PCA’, and in iwc, only the wbi is compressed by PCA. The LMA method uses

the model combination with three selected LMs of lowest perplexities (weights estimated by MSE with L2-

norm regularization). In the table, ’Size’ denotes the storage size (MB) for the whole LM set (i.e., there are

16 n-gram models for character-level, while only 15 n-gram models for word-level without ancient domain

LM. The size of iwc equals the sum of those of wbi and wcb).

From Table 10, we can see that language model compression by either SVQ or PCA yields little loss of

recognition accuracy, while the storage size is reduced significantly, and the combination of SVQ and PCA

reduces the size further. For recognition time, the decompression of SVQ takes very little time, though PCA

needs a little time on searching an n-gram from a larger shared list. This demonstrates the effectiveness of

the proposed LM compression methods. Comparing the results of all LM types, the compression of cbi is

27

the most efficient with the ratio of about 83.4 percent from 134MB to 22.3MB. This is because both SVQ

and PCA only process the probability values of LM, and this part holds the highest ratio in cbi due to its

smallest size of n-gram table (character bi-gram) in the five LM types.

In addition, we represent the n-gram table by the trie-structure format as in our previous work [24],

which removes the repeated prefixes of n-grams, hence it is lossless for recognition accuracy. By this

method, it compresses the storage size of each LM further and speed up the search process a little. The

results are shown in the last column (denoted as ’+Trie’) of Table 10, and we can see that the storage

sizes of whole LM set are finally compressed to 14.4MB, 58.7MB, 68.7MB, 28.7MB and 97.4MB for cbi,

cti, wbi, wcb and iwc, respectively, and the average size of each cbi is compressed to 0.9MB for about 7

thousand uni-grams and 0.7 million bi-grams.

Table 10: The effects (the accuracy CR/AR (%) and Size (MB)/Time (h) given by LMA using model combination with three selectedLMs) of language model compression on the CASIA-HWDB test set.

Original SVQ PCA SVQ+PCA +TrieLMs CR/AR Size/Time CR/AR Size/Time CR/AR Size/Time CR/AR Size/Time Size/Timecbi 91.69/90.85 134/0.26 91.68/90.77 50.7/0.26 91.62/90.77 77.2/0.34 91.47/90.56 22.3/0.34 14.4/0.26cti 92.37/91.73 470/0.49 92.35/91.70 178/0.49 92.18/91.55 291/0.76 92.12/91.48 102/0.74 58.7/0.50wbi 92.24/91.51 314/1.13 92.20/91.47 147/1.13 92.18/91.46 227/1.24 92.10/91.34 102/1.20 68.7/1.05wcb 91.83/91.05 81.6/1.26 91.74/90.87 38.7/1.31 N/A N/A N/A N/A 28.7/1.36iwc 92.34/91.60 396/1.31 92.30/91.51 186/1.30 92.32/91.58 309/1.36 92.24/91.44 141/1.41 97.4/1.41

7.5. Discussions

In this section, we discuss our LMA methods in four aspects. First, we show the effects of LMA on

different domains; Second, we evaluate the proposed LMA methods on documents with different recog-

nition accuracies by the baseline system; Third, we compare various LMA methods, including the model

combination method in speech recognition; Last, we give some recognition examples.

7.5.1. Effects of LMA on Different Domains

The above experiments show that all the LMA methods improve the recognition performance remark-

ably, justifying the importance of unsupervised LMA in HCTR. Here, we further investigate the effects of

28

LMA on the documents of different domains in the CASIA-HWDB test set7. The results of cti and iwc

(the other LM types give similar effects) are shown in Fig. 7, where the LMA uses the model combination

method with three selected LMs of lowest perplexities, and the combination weights are estimated by MSE

with L2-norm regularization. We can can see that the performance improvement on the ancient domain

(indexed as 15 in Table 3) documents is the largest, this is because the language style of ancient domain

texts is very different from that of general domain. Table 11 shows the results of LMA for ancient domain

documents using character-level LMs (no word-level LMs of ancient domain are available), and the cti

model improves the recognition accuracy by nearly 7 percent. The statistical test results are also shown in

Table 11, which verify the significance (sign ’+’ in the ’s-test’ column) of the performance improvement by

the LMA method.

(a) (b)

Figure 7: Effects of LMA on each domain indexed following Table 3 (”-LMA” denotes the results using LMA method, otherwise,they are baseline results). (a) cti, (b) iwc (for ancient domain, the second pass uses adaptive cti).

Table 11: Effects of LMA for ancient domain documents.Baseline LMA Improvement

LMs CR(%) AR(%) CR(%) AR(%) CR(%) AR(%) s-testcbi 82.26 81.52 88.33 87.66 6.07 6.14 +

cti 82.13 81.43 89.06 88.46 6.93 7.03 +

7.5.2. Effects of LMA on Different Baseline Accuracies

In this part, we analyse the recognition results of each document in CASIA-HWDB test set, to investi-

gate the effects of LMA on different baseline recognition accuracies. Figure 8 shows the recognition results

of cti model, where the LMA uses model combination of three lowest perplexity LMs with the weights

7In the CASIA-HWDB test set, we gave domain label to each document, but use this information in evaluation only. There are 83

documents labeled as ancient domain, while no documents in four domains of sport, house, Olympic and women.

29

estimated by MSE with L2-norm regularization. We can see that though most documents gained improved

accuracy by LMA, there is no consistent correlation between the baseline accuracy and the effectiveness of

LMA. For a document of low baseline accuracy, if most erroneous n-grams in the first-pass recognized text

(i.e., baseline result) are meaningless for all the LMs and thus do not affect the selection of the best LM,

it gains improved accuracy (e.g., the document of index 1 in Fig. 8). On the other hand, if these erroneous

n-grams happen to belong to one domain of the pre-defined LM set, they may result in a wrong selection,

and deteriorate the recognition performance (e.g. the document of index 15 in Fig. 8).

Figure 8: Effect of LMA on documents with different baseline recognition accuracies in CASIA-HWDB test set. The indices of 1 to39 represent the 39 documents of the minimum baseline CR (lower than 70 percent), and index 40 represents those documents withbaseline CR between 70 and 75 percent (the accuracy is the average CR value of those documents), and index 41 represents thosedocuments with baseline CR between 75 and 80 percent, and so on.

7.5.3. Comparison of Various LMA methods

This paper introduces three LMA methods, namely model selection, model combination and model

reconstruction, which are abbreviated as ”MS”, ”MC” and ”MR”, respectively. In model combination, we

propose three weight estimation methods, called heuristic method (MC-H), MSE with L2-norm (MC-L2)

and MES with L1-norm regularization (MC-L1). To compare these methods more clearly, in Table 12 we

summarize the results from Table 4 to Table 8 (cti model, three selected LMs for MC-H and MC-L2), give

the perplexity (PP) of each resultant model on the ground-truth texts of CASIA-HWDB test set, and show

the results of statistical tests (’s-test’, the same method for Table 9) for comparisons of pairs of methods. We

also compare with the model combination method with weights learned by maximum likelihood (MC-ML)

estimation [17] and the LMA method in [25], with their results shown in Table 12.

The results of the LMA method in [25] uses the ratio of string probability by each LM as the corre-

sponding weight. We can see that its results are similar to those of MS, this is because the string probability

30

Table 12: Summarized results of LMA using cti model on CASIA-HWDB test set.

Base MS MC-H MC-L2 MC-L1 MR MC-ML Ref [25]CR(%) 90.81 91.92 92.27 92.37 92.32 92.06 92.28 91.98AR(%) 90.21 91.37 91.71 91.73 91.64 91.49 91.71 91.43Time(h) 0.14 0.26 0.51 0.49 0.80 0.44 0.53 0.51

PP 252 109 95 48 42 118 93 110s-test N/A + ++ + + + + + −+ + + + + + + + − + −+ + − + + + − +

ratio of the best LM is usually close to one when the string is longer than ten characters, while the other

weights approach zero. The MC-ML method is commonly used in speech recognition8 [17]. Like the

MC-H method, the MC-ML estimation is also based on the perplexity, hence their results (both recognition

accuracy and perplexity) are similar.

From Table 12, we can see that our proposed LMA methods give better performance than the baseline

system in both recognition accuracy and perplexity. LMA gives a PP reduction of 53-83 percents, which

implies that the adapted LM matches the test document much better. Among the LMA methods, the MC

method reduces PP much more than the methods of MS and MR. This is because MC improves the general-

ization of each component model, and the performance of MR is limited to the shared n-gram list. Further

comparing all the MC methods, the perplexity of MC-H is much worse than those of MC-L2 and MC-L1.

The benefit of MC-L2 and MC-L1 is attributed to the fact that the object of MSE training makes each

resultant probability closer to one, and the solution is global optimal.

The comparison of recognition accuracies can lead to similar conclusion as that of perplexities, though

the improvement of accuracy is not as large as the reduction of perplexity. To verify the reliability of

performance improvement, each sign sequence in the last row of Table 12 gives the statistical tests results

for comparisons of that method with the methods in its left columns in turn (’+’: significant and ’−’: non-

significant at 0.05 significance level). We can see that all improvements by the LMA methods compared

to the baseline system are significant, and the MC-L2 method significantly outperforms all other methods,

including the MC-ML (its fourth sign is ’+’ and the correct rate (CR) is lower than that of MC-L2).

8It is difficult to use MAP estimation here due to no prior information of the dynamically selected LMs.

31

7.5.4. Recognition Examples

According to the error analysis in our previous paper [4], this paper focuses on correcting the path search

failure by giving an adaptive, more accurate language model in the path evaluation function. Figure 9 shows

three examples of strings, where the first two samples (Fig.9(a) and 9(b)) were also given in our previous

paper [4] as error examples. In such three examples, we can see that the second one (Fig.9(b)) is corrected

by the proposed LMA method (cti model with model combination adaption), demonstrating its benefits in

HCTR; while the first one (Fig.9(a)) is failed because true character class is out of the candidate character

class set; in the third example (Fig.9(c)), the candidate set includes the true character class, but it is finally

recognized wrongly due to the limitation of the n-gram model, which only considers short-distance context.

(a) (b)

(c)

Figure 9: Three recognition examples. (a) the error is not corrected due to the imperfect candidate character class set, (b) the error iscorrected by LMA, (c) the error is not corrected due to the short-distance context limitation of n-gram model (it ignores the semanticcorrelation of two characters indicated by the arrows, here both mean ’let’ and the green one is recognized correctly.). In each example,line 1 is the character string image, line 2 is the result of baseline system, line 3 is the result of LMA, and line 4 is the ground truth.

8. Conclusion

This paper presents an unsupervised language model adaptation (LMA) framework for handwritten

Chinese text recognition. Based on two-pass recognition strategy, we propose three methods to dynamically

generate an adaptive language model (LM) to match the test document via a pre-defined multi-domain LM

set, namely, model selection, model combination and model reconstruction. The experimental results show

that the model combination of three selected LMs performs best, considering the tradeoff between the

recognition accuracy and speed, by learning the combination weights via minimizing the sum of squared

error (MSE) with L2-norm regularization. The results on both CASIA-HWDB test set and HIT-MW test

32

set (very short texts) justify the benefits of the proposed unsupervised LMA method, especially for ancient

domain documents with the recognition accuracy improved by 7 percent. Aiming for practical applications,

we compress all LMs largely with little loss of accuracy using split vector quantization (SVQ) and principal

component analysis (PCA). The proposed methods are general enough such that they are applicable to

recognize documents of other languages (such as English, Arabic) as well.

The analysis of recognition errors indicates that further research efforts are needed to improve character

classification and language modeling. Better character classifiers can improve the tradeoff between the

number of candidate classes and the probability of including the true class. The path evaluation criterion

can be improved by using better language models that fuse as much as possible of the syntactic, semantic,

and pragmatic characteristics for the recognition task, i.e., considering long-distance context.

[1] R.-W. Dai, C.-L. Liu, B.-H. Xiao, Chinese Character Recognition: History, Status and Prospects, Frontiers of Computer Science

in China 1 (2) (2007) 126-136.

[2] H. Fujisawa, Forty Years of Research in Character and Document Recognition—An Industrial Perspective, Pattern Recognition

41 (8) (2008) 2435-2446.

[3] T.-H. Su, T.-W. Zhang, D.-J. Guan, H.-J. Huang, Off-Line Recognition of Realistic Chinese Handwriting Using Segmentation-

Free Strategy, Pattern Recognition 42 (1) (2009) 167-182.

[4] Q.-F. Wang, F. Yin, C.-L. Liu, Handwritten Chinese Text Recognition by Integrating Multiple Contexts, IEEE Trans. Pattern Anal.

Mach. Intell. 34 (8) (2012) 1469-1481.

[5] J.R. Bellegarda, Statistical Language Model Adaptation: Review and Perspectives, Speech Communication 42 (1) (2004) 93-108.

[6] A. Sethy, P.G. Georgiou, B.Ramabhadran, S.Narayanan, An Iterative Relative Entropy Minimization-Based Data Selection Ap-

proach for n-Gram Model Adaptation, IEEE Trans. Audio, Speech, Language Processing 17 (1) (2009) 13-23.

[7] A. Stolcke, SRILM—an extensible language modeling toolkit, Proc. 7th ICSLP, 2002, pp. 901-904.

[8] J.R. Bellegarda, Exploiting Latent Semantic Information in Statistical Language Modeling. Proceedings of the IEEE 88 (8) (2000)

1279-1296.

[9] D. Mrva, P.C. Woodland, Unsupervised Language Model Adaptation for Mandarin Broadcast Conversation Transcription, Proc.

7th Interspeech, 2006, pp. 2206-2209.

[10] Y.-C. Tam, T. Schultz, Correlated Latent Semantic Model for Unsupervised LM Adaptation, Proc. 32th ICASSP, 2007, pp. 41-44.

[11] M. Bacchiani, B. Roark, Unsupervised Language Model Adaptation, Proc. 28th ICASSP, 2003, pp. 224-227.

[12] G. Tur, A. Stolcke, Unsupervised Language Model Adaptation for Meeting Recognition, Proc. 32th ICASSP, 2007, pp. 173-176.

[13] P. Xiu, H.S. Baird, Whole-Book Recognition, IEEE Trans. Pattern Anal. Mach. Intell. 34 (12) (2012) 2467-2480.

33

[14] D.-S. Lee, R. Smith, Improving Book OCR by Adaptation Language and Image Models, Proc. 10th Int. Workshop on Document

Analysis Systems, 2012, pp. 115-119.

[15] P.C. Woodland, T. Hain, G.L. Moore, T.R. Niesler, D. Povey, A. Tuerk, E.W.D. Whittaker, The 1998 HTK Broadcast News

Transcription System: Development and Results, Proc. DARPA Broadcast News Workshop, 1999, pp. 265-270.

[16] C. Allauzen, M. Riley, Bayesian Language Model Iterpolation for Mobile Speech Input, Proc. 12th Interspeech, 2011, pp. 1429-

1432.

[17] X. Liu, M.J.F. Gales, P.C. Woodland, Use of Contexts in Language Model Interpolation and Adaptation, Computer Speech and

Language 27 (2013) 301-321.

[18] A. Stolcke, Entropy-Based Pruning of Backoff Language Models, Proc. DARPA Broadcast News Workshop, 1998, pp. 270-274.

[19] E.W.D. Whittaker, B. Raj, Quantization-based Language Model Compression, Proc. 7th Eurospeech, 2001, pp.33-36.

[20] C.-L. Liu, F. Yin, D.-H. Wang, Q.-F. Wang, CASIA Online and Offline Chinese Handwriting Databases, Proc. 11th ICDAR,

2011, pp. 37-41.

[21] T.-H. Su, T.-W. Zhang, D.-J. Guan, Corpus-Based HIT-MW Database for Offline Recognition of General-Purpose Chinese

Handwritten Text, Int. J. Document Analysis and Recognition 10 (1) (2007) 27-38.

[22] L. Sirovich, M. Kirby, Low-dimensional precedure for the characterization of human faces, J. Optical Society of America A 4

(3) (1987) 519-524.

[23] T.F. Cootes, C.J. Taylor, D.H. Cooper, J. Graham, Active Shape Models-Their Training and Application, Computer Vision and

Image Understanding 61 (1) (1995) 38-59.

[24] Q.-F. Wang, F. Yin, C.-L. Liu, Improving Handwritten Chinese Text Recognition by Unsupervised Language Model Adaptation,

Proc. 10th Int. Workshop on Document Analysis Systems, 2012, pp. 110-114.

[25] Q. He, S. Chen, M. Zhao, W. Lin, A Hybrid Language Model for Handwritten Chinese Sentence Recognition, Proc. 13th ICFHR,

2012, pp. 129-134.

[26] J.F. Gao, H. Suzuki, W. Yuan, An Empirical Study on Language Model Adaptation, ACM Trans. Asian Language Information

Processing 5 (3) (2005) 209-227.

[27] F.-F. Liu, Y. Liu, Unsupervised Language Model Adaptation Incorporating Named Entity Information, Proc. 45th ACL, 2007,

pp. 672-679.

[28] D.H.-Daines, A.I. Rudnicky, Implicitly Supervised Language Model Adaptation for Meeting Transcription, Proc. NAACL-HLT,

2007, pp. 73-76.

[29] L. Chen, J.-L Gauvain, L. Lamel, G. Adda, Dynamic Language Modeling for Broadcast News, Proc. 8th ICSLP, 2004, pp.

1281-1284.

[30] W. Wang, A. Stolcke, Integrating MAP, Marginals, and Unsupervised Language Model Adaptation, Proc. 8th Interspeech, 2007,

pp. 618-621.

[31] S.J. Pan, Q. Yang, A Survey on Transfer Learning, IEEE Trans. Knowledge and Data Engineering 22 (10) (2010) 1345-1359.

34

[32] J.F. Gao, M. Zhang, Improving Language Model Size Reduction using Better Pruning Criteria, Proc. 40th ACL, 2002, pp.

176-182.

[33] B.-J. Hsu, J. Glass. Iterative Language Model Estimation: Efficient Data Structure & Algorithms, Proc. 9th Interspeech, 2008,

pp. 1-4.

[34] A. Pauls, D. Klein, Faster and smaller n-gram language models, Proc. 49th ACL-HLT, 2011, pp. 258-267.

[35] W. Law, C.F. Chen, Split-dimension vector quantization of Parcor coefficients for low bit rate speech coding, IEEE Trans. Speech

Audio Process 2 (3) (1994) 443-446.

[36] T. Long, L.W. Jin, Building compact MQDF classifier for large character set recognition by subspace distribution sharing, Pattern

Recognition 41 (2008) 2916-2925.

[37] R. Rosenfeld, Two Decades of Statistical Language Modeling: Where Do We Go from Here? Proceedings of the IEEE 88 (8)

(2000) 1270-1278.

[38] S.F. Chen, J. Goodman, An Empirical Study of Smoothing Techniques for Language Modeling, Computer Speech and Language

13 (1999) 359-394.

[39] A. Vinciarelli, S. Bengio, H. Bunke, Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical

Language Models, IEEE Trans. Pattern Anal. Mach. Intell. 26 (6) (2004) 709-720.

[40] Y.X. Li, C.L. Tan, X.Q. Ding, A Hybrid Post-Processing System for Offline Handwritten Chinese Script Recognition, Pattern

Analysis and Applications 8 (3) (2005) 272-286.

[41] D. Jurafsky, J.H. Martin, Speech and Languaue Processing, 2nd ed. Pearson Prentice Hall, 2008.

[42] Q.-F. Wang, F. Yin, C.-L. Liu, Integrating Language Model in Handwritten Chinese Text Recognition, Proc. 10th ICDAR, 2009,

pp. 1036-1040.

[43] J. Friedman, T. Hastie, H. Hofling, R. Tibshirani, Pathwise Coordinate Optimization, The Annals of Applied Statistics 1 (2)

(2007) 302-332.

[44] F. Kimura, K. Takashina, S. Tsuruoka, Y. Miyake, Modified Quadratic Discriminant Functions and The Application to Chinese

Character Recognition, IEEE Trans. Pattern Anal. Mach. Intell. 9 (1) (1987) 149-153.

[45] C.-L. Liu, Normalization-Cooperated Gradient Feature Extraction for Handwritten Character Recognition, IEEE Trans. Pattern

Anal. Mach. Intell. 29 (8) (2007) 1465-1469.

[46] S. Martin, J. Liermann, H. Ney, Algorithms for Bigram and Trigram Word Clustering, Speech Communication 24 (1998) 19-37.

[47] J. Dems̆ar, Statistical Comparisions of Classifiers over Multiple Data Sets, Journal of Machine Learning Research 7 (2006) 1-30.

35

unsupervised language model adaptation for handwritten chinese text recognition

Documents