unsupervised language model adaptation for handwritten chinese text recognition
TRANSCRIPT
Author's Accepted Manuscript
Unsupervised language model adaptation forhandwritten Chinese text recognition
Qiu-Feng Wang, Fei Yin, Cheng-Lin Liu
PII: S0031-3203(13)00387-7DOI: http://dx.doi.org/10.1016/j.patcog.2013.09.015Reference: PR4919
To appear in: Pattern Recognition
Received date: 13 December 2012Revised date: 17 September 2013Accepted date: 19 September 2013
Cite this article as: Qiu-Feng Wang, Fei Yin, Cheng-Lin Liu, Unsupervisedlanguage model adaptation for handwritten Chinese text recognition, PatternRecognition, http://dx.doi.org/10.1016/j.patcog.2013.09.015
This is a PDF file of an unedited manuscript that has been accepted forpublication. As a service to our customers we are providing this early version ofthe manuscript. The manuscript will undergo copyediting, typesetting, andreview of the resulting galley proof before it is published in its final citable form.Please note that during the production process errors may be discovered whichcould affect the content, and all legal disclaimers that apply to the journalpertain.
www.elsevier.com/locate/pr
Unsupervised Language Model Adaptation forHandwritten Chinese Text Recognition
Qiu-Feng Wang, Fei Yin, Cheng-Lin Liu
National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of SciencesNo.95 Zhongguancun East Road, Beijing 100190, P.R. China
Email: {wangqf, fyin, liucl}@nlpr.ia.ac.cn
Abstract
This paper presents an effective approach for unsupervised language model adaptation (LMA) using multi-
ple models in offline recognition of unconstrained handwritten Chinese texts. The domain of the document
to recognize is variable and usually unknown a priori, so we use a two-pass recognition strategy with a
pre-defined multi-domain language model set. We propose three methods to dynamically generate an adap-
tive language model to match the text output by first-pass recognition: model selection, model combination
and model reconstruction. In model selection, we use the language model with minimum perplexity on
the first-pass recognized text. By model combination, we learn the combination weights via minimizing
the sum of squared error with both L2-norm and L1-norm regularization. For model reconstruction, we
use a group of orthogonal bases to reconstruct a language model with the coefficients learned to match
the document to recognize. Moreover, we reduce the storage size of multiple language models using two
compression methods of split vector quantization (SVQ) and principal component analysis (PCA). Com-
prehensive experiments on two public Chinese handwriting databases CASIA-HWDB and HIT-MW show
that the proposed unsupervised LMA approach improves the recognition performance impressively, partic-
ularly for ancient domain documents with the recognition accuracy improved by 7 percent. Meanwhile, the
combination of the two compression methods largely reduces the storage size of language models with little
loss of recognition accuracy.
Keywords: Character string recognition, Chinese handwriting recognition, unsupervised language model
adaptation, language model compression.
Preprint submitted to Pattern Recognition September 26, 2013
1. Introduction
Handwritten Chinese character recognition has attracted a lot of attention since the 1970s and has
achieved tremendous advances [1, 2]. However, the recognition of unconstrained handwritten Chinese
texts has been reported only in recent years, and the reported accuracies are quite low (e.g., character-level
correct rate of 39 percent in [3]). Besides the divergence of writing styles, handwritten text recognition
is difficult due to the weak lexical constraint: the number of sentence classes is infinite. Although our
recent work by integrating multiple contexts including linguistic context achieved a high correct rate of 91
percent [4], there are still many recognition errors due to the insufficient modeling of linguistic context, par-
ticularly, the language model often mismatches the domain of handwritten text. To deal with this mismatch
problem, we investigate language model adaptation for handwritten Chinese text recognition (HCTR) in
this paper, particularly, unsupervised adaptation for the scenario that there is no prior information about the
domain of text.
Language model (LM) provides a principled way to quantify the uncertainties associated with the nat-
ural language, but this uncertainty is variable in the texts of different domains. The modeling of diverse
domains was originally studied in speech recognition [5], leading to the research of language model adap-
tation. The researchers often combine a generic LM with a domain-specific LM that is more relevant to the
recognition task. However, it is usually difficult to get enough domain-specific data to learn the domain-
specific model [5], and a growing interest is evident in collecting texts from the Internet to supplement
sparse domain-specific resources [6]. For Chinese resources, the Sogou Labs1 provide a large set of re-
sources relevant to diverse domains extracted from the Internet, which can be used to train a multi-domain
LM set via the SRILM toolkit [7]. This LM set can be adaptively applied in HCTR according to the domain
of each document.
Language model adaptation (LMA) is difficult when the domain of recognition task is unknown a priori,
which can only be solved by unsupervised approaches. Many efforts have been made in this direction in
speech recognition. One method is to calculate the probability of one word given a document (herein, the
history of recognized text) as an uni-gram model by latent topic analysis [5, 8, 9, 10], then interpolate it
with the generic LM. Another popular method is to use the recognized text directly to estimate an adaptive
1http://www.sogou.com/labs/resources.html
2
n-gram model [11, 12], which is interpolated with the generic LM for the next-pass recognition. This
method is usually based on a multi-pass (e.g., two-pass) recognition framework. In handwriting recognition,
unsupervised LMA has been reported only in recent years, and the adaptive LM is very simple. For example,
Xiu and Baird [13] applied the multi-pass recognition strategy iteratively to adapt a word lexicon, and
Lee and Smith [14] iteratively modified uni-gram probabilities further for English whole-book recognition,
where the texts are very long.
Another difficulty of LMA is that the transcripts of handwritten texts are usually short. There are
usually only a few hundreds of characters in a handwritten text (e.g., Fig. 1), impeding the direct adaption
of the lexicon or n-gram probabilities. In such situation, the interpolation of models from a pre-defined
multi-domain LM set is usually used in speech recognition, and the interpolation weights are learned by
the maximum a posteriori (MAP) estimation from the held-out data similar to the task domain (supervised
LMA, e.g., [5, 15]), or the recognized texts (unsupervised LMA, e.g., [16]) or both (e.g., [17]). This solution
also brings a problem of the large storage size of the LM set. To overcome this problem, the size of each
LM is usually reduced by entropy-based pruning [18] and quantization-based compression [19].
In this paper, we propose an unsupervised LMA framework for HCTR. For no prior information of
the document to recognize (test document) is available, we use a two-pass recognition strategy. In con-
sideration of various and short handwritten texts, we dynamically generate an adaptive LM to match the
first-pass recognized text of each document using a pre-defined multi-domain LM set. We propose three
methods to generate adaptive LMs, namely model selection, model combination and model reconstruction.
The model selection method is to select the best LM according to the minimum perplexity criterion. In
model combination, we estimate the combination weights by minimizing the sum of squared error (MSE)
with both L2-norm and L1-norm regularization. By model reconstruction, the adaptive LM is constructed
by a group of orthogonal bases. To make the adaption approach practical, we also consider the reduction of
computational cost and storage space. To speed up the two-pass recognition, we store the candidate charac-
ter classes of each candidate character pattern in the first-pass recognition to avoid repeated classification in
the second-pass. To reduce the storage size of LM set, we compress the LMs using split vector quantization
(SVQ) and principal component analysis (PCA). Finally, we evaluated the recognition performance on two
3
(a)
(b)
Figure 1: Two examples of handwritten Chinese text. (a) modern domain text; (b) ancient domain text.
public Chinese handwriting databases CAISA-HWDB [20] and HIT-MW [21], and showed large improve-
ment of performance gained by the proposed unsupervised LMA methods with comparable computational
cost to the baseline system.
Unlike previous works on unsupervised LMA in speech recognition (e.g., [17]), we regard the language
model combination as a linear regression problem. We learn combination weights via minimizing an error
cost function (i.e., MSE) and further consider model sparsity by adding L1-norm regularization, which is
totally different from the MAP or MBR (minimum Bayes’ risk) based framework [17]. The MAP estimation
is based on the perplexity and MBR takes into account the acoustic model under supervised framework,
and both get a local optimal solution by Baum-Welch algorithm, while the MSE-based method aims at
error loss minimization on the recognized text with a global optimal solution. Another contribution of this
paper is the new LMA method based on model reconstruction by applying PCA to a pre-defined LM set.
This idea is motivated by a technique developed by Sirovich and Kirby [22] for efficiently representing
4
images of human faces using PCA, and it has also been successfully used in image processing like active
shape models [23]. More importantly, the focus of this work is to investigate the role of LMA in Chinese
handwriting recognition, which, to our best of knowledge, has not been investigated in depth in handwriting
recognition field. A preliminary conference version of the paper was presented in [24], and this extended
version provides more detailed descriptions, presents additional LMA methods and a significantly extended
experimental validation. By customizing the handwriting recognition algorithm, the proposed approach can
also apply to the recognition of the documents of other languages (such as English and Arabic).
The rest of this paper is organized as follows: Section 2 reviews some related works; Section 3 gives an
overview of our HCTR system with unsupervised LMA; Section 4 briefly describes the statistical language
models used in this paper; Section 5 describes in details the unsupervised LMA methods of model selection,
model combination and model reconstruction; Section 6 introduces the LM compression methods; Section 7
presents the experimental results, and Section 8 draws concluding remarks.
2. Related Works
The large variability of domains across different handwritten texts makes accurate language modeling
a challenge. Language model adaptation (LMA) is a process that adapts the language model to match
the domain of each recognition task. However, LMA has been rarely studied in handwriting recognition.
Recently, Xiu and Baird [13] adapted a word lexicon from the previous recognized text for English whole-
book recognition, and Lee and Smith [14] further modified word uni-gram probabilities in caches. In
Chinese handwriting recognition, we reported our first attempt to unsupervised LMA by simple model
combination method [24], then a similar work was presented in [25], where the estimation of combination
weights is very sensitive to the string length.
Although few works of LMA in handwriting recognition have been reported, many studies have been
conducted for LMA in speech recognition [5, 17] and natural language processing (NLP) [26, 27], which
can be categorized into supervised and unsupervised LMA. Supervised LMA assumes that the topic in-
formation of recognition task is known in advance, and meanwhile, a set of domain-specific held-out data
5
is available to either learn the interpolation weights of several pre-defined LMs (e.g., [5, 15]) or train a
domain-specific LM to interpolate with the generic LM (e.g., [26, 28]). In contrast, various unsupervised
methods utilize recognized text to get an adaptive LM, which can be grouped into three categories.
• Latent topic analysis. Such method views the recognized text as a document and calculates the
probability of a word given the document using latent semantic analysis (LSA) [5, 8] or its prob-
abilistic extensions: probabilistic LSA (PLSA) and latent Dirichlet allocation (LDA) [9, 10], then
this probability is seen as an adaptive uni-gram model to interpolate with the generic LM for further
recognition.
• Estimating an n-gram model. If the recognized text is long enough, it can be directly used to train
an adaptive n-gram model [11, 12]; otherwise, it can be used as queries to select relevant data from a
large general corpus via information retrieval (IR) techniques, to train an adaptive n-gram model [29].
Finally, this adaptive n-gram model is also interpolated with the generic LM for further recognition.
• Mixture models. Instead of training a domain-specific LM to interpolate with the generic LM, the
recognized text is used to learn the interpolation weights of mixture models, where each component
LM is related to one pre-defined domain [16, 17]. The interpolation weights are typically estimated
by the maximum a posteriori (MAP) method [17]. The Bayesian method has been used in the mobile
speech recognition system when the prior probability of each domain is known [16].
Both supervised and unsupervised LMA improve the recognition performance, and Wang and Stolcke [30]
combined such two methods producing an additional gain. However, unsupervised LMA is more relevant
to real applications where the topic is unknown a priori.
In addition, LMA is closely related to transfer learning [31] in the field of machine learning, which tries
to give a solution to the mismatch between training data and test data in either the feature representation or
the samples distribution. The distribution mismatch is the concern of LMA methods, because the statistical
LM is actually a probability distribution on character or word sequences [5].
With text corpora increasingly available, the statistical LM directly trained from the corpora can contain
billions of n-grams. Many efforts have been made to reduce the storage of such models. On one hand,
6
several methods focus on how to remove the useless n-grams [18, 32]. On the other hand, more compact
representations of n-gram models without removing any n-grams are studied [19, 33, 34], such as trie
structure of n-gram table and data quantization of n-gram values (i.e., probabilities and back-off weights).
While these methods encode each n-gram separately, the quantization of groups of n-grams can further
compress the LM efficiently. The split vector quantization (SVQ) technique splits high-dimensional vectors
into low-dimensional ones and compresses them by vector quantization. It was originally developed in
speech recognition [35], and has been successfully used to compress a Gaussian classifier in handwriting
recognition [36]. This technique, however, has not been evaluated in compressing statistical language
models.
3. System Overview
The baseline handwriting recognition system without LMA is introduced in our previous paper [4],
which is based on the integrated segmentation-and-recognition framework with character over-segmentation.
For no prior information of each handwritten text is available, we hereby use a two-pass recognition strategy
for LMA. In the first-pass recognition, a generic LM is used to get a preliminary recognized text, then the
text is used to get an adaptive LM for the second-pass recognition.
Figure 2 illustrates the block diagram of the complete system with LMA, where seven steps are included
in the first-pass recognition:
(1) The document image is segmented into text lines;
(2) Each line image is over-segmented into a sequence of primitive segments (Fig. 3(a)) such that each
segment is a character or a part of a character;
(3) One or more consecutive segments are concatenated to generate candidate character patterns (Fig. 3(b));
(4) Each candidate pattern is classified to assign several candidate character classes, forming a character
candidate lattice (Fig. 3(c));
(5) Each sequence of candidate character classes is matched with a lexicon to segment into candidate
7
Figure 2: System diagram for handwritten Chinese text recognition with LMA.
words2, forming a word candidate lattice (Fig. 3(d));
(6) Each character or word sequence C paired with candidate pattern sequence Xs (this pair is called a
candidate segmentation-recognition path) is evaluated by fusing multiple contexts, and the optimal
path is searched to give the segmentation and recognition result;
(7) The recognized results of all text lines are concatenated to give the document transcript (recognized
text), which is used for LMA in the second-pass recognition or output as the final result.
In the above, only the last three steps are needed in the second-pass recognition, because the character
candidate lattice (all candidate character classes) of each text line is independent of the LM, and can be
stored (use little memory) to avoid repeated character classification after the first-pass recognition. Note
that the classification of all candidate character patterns dominates the whole recognition time (see Table 4),
by saving this, the two-pass recognition system has little additional cost compared to the baseline system.
In this work, we evaluate each segmentation-recognition path using the function of weighting with
2In Chinese, a word may comprise one or multiple characters, which can explore both syntactic and semantic meaning better than
a character.
8
(a) (b)
(c) (d)
Figure 3: (a) Over-segmentation; (b) Segmentation candidate lattice; (c) Character candidate lattice of a segmentation (thick path) in(b); (d) Word candidate lattice of (c).
character pattern width (WCW), which integrates the character classification score, geometric context and
linguistic context [4]:
f (Xs,C) =m∑
i=1
(wi · lp0i +
4∑j=1
λ j · lp ji ) + λ5 · log P(C), (1)
where m is the number of characters in the path Xs, and wi is the width of the i-th character pattern after
normalizing by the estimated height of the text line. lp0i = log p(ci|xi) is the character classification score
for candidate character class ci of the i-th character pattern, which is calculated by the confidence transfor-
mation of the character classifier outputs. lp1i = log p(ci|guc
i ), lp2i = log p(ci−1ci|gbc
i ), lp3i = log p(zp
i = 1|guii ),
and lp4i = log p(zg
i = 1|gbii ) are four geometric model scores of the i-th character pattern, where guc
i , gbci ,
guii and gbi
i represent geometric features of unary class-dependent (e.g., character position), binary class-
dependent (e.g, distance between two consecutive characters), unary class-independent (e.g., size of bound-
ing box) and binary class-independent (e.g, gap between two consecutive bounding box), respectively.
zpi = 1 and zg
i = 1 mean a valid character pattern and valid between-character gap, respectively. log P(C)
denotes the linguistic context score of this path. The combining weights λ j, j = 1, . . . , 5 are optimized by
Maximum Character Accuracy (MCA) training. Under this path evaluation function, we use a refined beam
search method to efficiently find the optimal path. All the details of these techniques can be found in our
previous paper [4], and we focus on how to get the adaptive language model in this paper, i.e., the part
surrounded by dotted-line in Fig. 2.
9
4. Statistical Language Model
In the path evaluation function Eq. (1), the linguistic context score log P(C) plays a very important role,
which is usually given by a statistical language model. The most popular language model is the n-gram
model [37], where n is called the order of the model. Such model characterizes the statistical dependency
between n characters or words. In consideration of model complexity, the order n usually takes 2 or 3,
meaning the bi-gram and tri-gram model, respectively. In this paper, we evaluate five types of n-gram
models, which have been successfully used in our baseline system [4], namely, character bi-gram (cbi),
character tri-gram (cti), word bi-gram (wbi), word class bi-gram (wcb), and interpolating word and class
bi-gram (iwc). All these models are summarized in Table 1, where C =< c1 . . . cm > is a character sequence,
and m is the character number of C. In word level, C is segmented into a word sequence C =< w1 . . .wl >,
where l is the word number, and Wi is the word class of the word wi, λ6 is learned together with the weights
in Eq. (1) by MCA training [4].
Table 1: The five types of n-gram models used in our system.
level n-gram formula
character cbi Pcbi(C) =m∏
i=1p(ci|ci−1)
cti Pcti(C) =m∏
i=1p(ci|ci−2ci−1)
word wbi Pwbi(C) =l∏
i=1p(wi|wi−1)
wcb Pwcb(C) =l∏
i=1p(wi|Wi)p(Wi|Wi−1)
iwc log Piwc(C) = log Pwbi(C) + λ6 · log Pwcb(C)
The probabilities of an n-gram model are usually estimated from a large text corpus using maximum
likelihood estimation with smoothing techniques [38]. Smoothing is used to overcome the problem of
zero probabilities of the unseen n-grams in the training corpus. Many smoothing techniques have been
proposed in the literature [38], but none of them is systematically superior to the others in handwriting
recognition [39, 40]. We use the Katz back-off smoothing method [38, 41], which is the default method in
SRILM [7] toolkit. The back-off means that, when a higher order n-gram (< ω1, . . . , ωn >) does not appear
10
in the corpus, the probability estimation is back to a lower order n-gram calculation. It is formulated by
p(ωn|ωn−11 ) =
p∗(ωn|ωn−1
1 ), if C(ωn1) > 0,
α(ωn−11 ) · p(ωn|ωn−1
2 ), if C(ωn1) = 0,
(2)
where ωn−11 =< ω1, . . . , ωn−1 > is the n− 1 history characters (words) of ωn, C(·) is the number of times the
argument is counted in the corpus, p∗(·) is the Good-Turing discounting probability [38, 41] and α(·) is the
scaling factor to ensure the smoothed probabilities p(·) normalized to one.
Obviously, an n-gram model has two categories of parameters besides the n-gram table: discounting
probabilities p∗(·) and scaling factors α(·), and only the appeared n-grams are needed. For example, a bi-
gram model needs to store only the probabilities of all appeared n-grams (i.e., bi-grams and uni-grams) and
the scaling factors of all appeared uni-grams. However, there are still too many n-gram parameters due to
the large number of Chinese characters (more than 7,000 in our system) and words (about 0.3 million). In
our system, we use the entropy-based pruning algorithm [18] to remove those n-grams raising the perplexity
(due to pruning them) less than a threshold, which is set empirically as 5 × 10−8, 10−7 and 10−7 for cbi, cti
and wbi, respectively [42]. Since the word class number (1,000 in our system) leads to a moderate model
size, the parameters of wcb are not pruned.
In addition, we evaluate the performance of a language model using the perplexity (PP) [41], which is
calculated as the reciprocal of geometric average probability:
PP(C) = P(C)−1m =
1m√∏m
i=1 p(ci|hi), (3)
where C is a measured text containing m characters (words), hi is the history characters (words) of ci, and
p(ci|hi) is given by the evaluated language model (e.g., bi-gram: p(ci|hi) = p(ci|ci−1)). The perplexity is
similar to the negative log-likelihood of the language model on the text C. They show that lower perplexity
indicates a better model.
Each n-gram model above (e.g, cbi, cti.) can be seen as a discrete probability distribution on all n-
grams, which can be represented as a vector with the dimensionality as the number of all n-grams. This
concept of vector representation will be adopted in the following sections.
11
5. Language Model Adaptation
This section presents three language model adaptation (LMA) methods, which are necessary when the
generic language model (LM) does not match well the handwritten text. Because the domain of handwritten
text is variable and unknown a priori, we use a two-pass recognition strategy for unsupervised adaptation
of language model, which is described as follows:
Two-Pass recognition for unsupervised LMA
(1) Use a generic language model (LM0) to recognize a document to obtain a preliminary transcript C;
(2) Generate an adaptive language model (LM∗) that best matches the preliminary transcript;
(3) Use LM∗ to recognize the document again to obtain the final transcript.
More passes of recognition can be considered, but our experiments show that this does not improve the
recognition performance further. In the above, the Step 2 plays a key role in the whole process, and we will
describe how to generate the adaptive LM via three methods, namely, model selection, model combination
and model reconstruction.
5.1. Model Selection
In the two-pass recognition strategy, we can get the transcript of each document after the first-pass
recognition. Although this transcript is a very direct domain-specific data, the number of characters (words)
is too small (specifically, 200 - 300 characters in each document of our database, and words are fewer) and
contains recognition errors. However, it does carry the domain information of the document, and we can
use this information to get a matched language model.
Meanwhile, we can prepare a language model set ({LM1, LM2, . . . , LMK}, K is the number of pre-
defined domains), and each language model (LMk) corresponds to one specific domain (e.g., sport, business,
health), which is trained from a large text corpus (Tk) relevant to this domain. All of these text corpora
can be easily obtained from the Internet (e.g., the resources from the Sogou Labs). Moreover, we have a
language model (LM0) for the general domain, which is also used in the first-pass recognition. Finally, we
have K + 1 language models in total.
12
Since the perplexity measures the quality of one LM, we can straightforwardly choose the best one with
the minimum perplexity from the pre-defined multi-domain LM set (including the generic LM):
k∗ = arg mink
PPk(C), 0 ≤ k ≤ K, (4)
where PPk(C) is the perplexity of the k-th language model (LMk) on the first-pass transcript C. According
to the definition of perplexity Eq. (3), this criterion is equivalent to maximizing the log-likelihood. This
method works under the assumption that the domain in one document is discrete and match well with one
pre-defined LM.
5.2. Model Combination
Model selection is straightforward to select one model from the pre-defined multi-domain LM set, but
such a LM set cannot cover all possible domains in practice. When the domain of handwritten text is
a hybrid of several domains or out of the pre-defined set, such method will fail. To overcome this, we
generate a new adaptive LM via linear combination of the pre-defined multiple LMs:
p(ωn|ωn−11 ) =
K∑k=0
βk · pk(ωn|ωn−11 ), (5)
where pk(·) is calculated by the k-th language model (LMk), and the parameter βk serves as the combination
weight to control the relative importance of LMK . To reduce the computational cost, we can consider a
reduced number of LMs with lower perplexities while viewing the remaining LMs irrelevant to the test
document. Given the pre-defined multiple LMs, the key issue of model combination is to estimate the
combination weights to match the test document, and we propose three estimation methods in the following.
We first introduce a heuristic method to estimate these weights. The perplexity Eq. (3) denotes the
goodness of the corresponding LM fitting the transcript C, and the lower perplexity means a better model.
Therefore, the weight of one LM should be proportional to the reciprocal of perplexity:
βk ∝1
PPk(C), k = 0, 1, . . . ,K. (6)
In this way, we can get the weights by normalizing the reciprocals of perplexity into unity:
βk =
1PPk(C)∑K
i=01
PPi(C)
, k = 0, 1, . . . ,K. (7)
13
Like the model selection method Eq. (4), this method only needs to calculate the perplexity of each LM,
which is fast to implement. Actually, it is similar to the weights estimation method via string probability
in [25], where P(C) was used instead of 1PP(C) in Eq. (7). Unlike the perplexity PP(C), the value of P(C)
(see Table 1) is not normalized with respect to the length of sequence, and thus is very sensitive to the string
length.
Alternatively, it is possible to learn these weights using machine learning algorithms by viewing this
problem as a linear regression model. Given the first-pass transcript C, we can extract all the n-grams in
the C to form a set of training samples. Each sample is an n-gram (< ω1, . . . , ωn >), and its features are
the corresponding K + 1 LM probabilities (x = (x0, x1, . . . , xK)T , where xk = pk(ωn|ωn−11 ) by the k-th LM,
k = 0, 1, . . . ,K). For each sample, we set the target probability of the n-gram as t = 1 to hope that such n-
gram will be chosen in the second-pass recognition (the path evaluation function tends to choose the n-gram
with higher probability). Meanwhile, an estimated value is given by y =∑K
k=0 bT x according to Eq. (5),
where b = (β0, β1, . . . , βK)T . On obtaining the training samples of the test document, the weights can be
learned by minimizing the sum of squared error (MSE):
minb
F(b) =N∑
i=1
(ti − bT xi)2, (8)
where N is the sample number, and the vector xi = (xi0, xi1, . . . , xiK)T denotes the features of the i-th sample.
To alleviate the over-fitting, a regularization term is usually added in Eq. (8) to constrain the parameters,
leading to a modified error function:
minb
F(b) =N∑
i=1
(ti − bT xi)2 + λ · ∥b∥22, (9)
where the hyper-parameter λ (λ ≥ 0) governs the relative importance of the regularization term. In this
formulation, the computation of b is a quadratic programming problem, and has a closed-form solution:
b = N∑
i=1
xixTi + λI
−1 N∑i=1
tixi, (10)
where I is an identity matrix. For the tradeoff parameter λ, considering the influence of data scaling, we
14
suggest to set λ as
λ =λ̃
K + 1
K+1∑j=1
|M j j|, (11)
where M =∑N
i=1 xixTi , and λ̃ is selected on a validation data set.
The regularization term in Eq. (9) is the L2-norm penalty, and such model is also known as ridge
regression. Alternatively, we also evaluate the L1-norm regularization for model sparsity:
minb
F(b) =N∑
i=1
(ti − bT xi)2 + λ · ∥b∥1. (12)
This model is also known as Lasso regression. It can be solved by the coordinate-wise descent algo-
rithm [43]:
βk(λ)← S
βk(λ) +N∑
i=1
xik(ti − yi), λ
, (13)
where yi = bTxi, and S (·) is a soft-thresholded function defined by
S (β̃, λ) =
β̃ − λ, if β̃ > 0 and λ < |β̃|,
β̃ + λ, if β̃ < 0 and λ < |β̃|,
0, if λ ≥ |β̃|.
(14)
The update Eq. (13) is repeated for k = 0, 1, . . . ,K, 0, 1, . . . until convergence, and each update is very
quick in our experiments, because the sample number (N, i.e., the number of n-grams) is usually small
in one document. Compared to the L2-norm regularization, the L1-norm usually yields a sparse solution
(some learned weights are zero). This means that fewer language models are combined, which makes the
system faster.
5.3. Model Reconstruction
Viewing one pre-defined LM as a sample, the above methods of LM adaptation by either selecting one
sample or combining several samples can be seen as a joint representation for the LM set (sample set). In
this section, we propose another parametric form using a group of orthogonal bases to reconstruct a LM:
s = µ + Urvr. (15)
15
This idea is motivated from the work [22] for representing images of human faces, and it has been success-
fully used in other areas like active shape models [23].
In the above, the term s ∈ ℜd is a reconstructed sample, denoting a new LM (adaptive LM), with each
element representing the probability value of an n-gram, and the dimensionality d is the size of the n-gram
list shared in the orthogonal space. The term µ = 1Ns
∑Nsi=1 si is the sample mean, and Ns is the sample
number (herein, the number of pre-defined LMs, i.e., Ns = K + 1). The r columns of matrix Ur ∈ ℜd×r
are the orthogonal bases obtained by applying principal component analysis (PCA) to the sample set, and
vr ∈ ℜr denotes the coefficients of the sample’s projection on the orthogonal space.
The computation of the orthogonal bases follows the conventional PCA, which denotes that the r
columns of matrix Ur are the first r eigenvectors of the sample covariance matrix Σ = 1K+1 S̄ S̄ T , where
S̄ =[s0 − µ, . . . , sK − µ
]. One might wonder that how to efficiently get the eigenvectors of the large-
scale matrix S̄ S̄ T ∈ ℜd×d (d is the order of million). Fortunately, the dimensionality of the matrix
S̄ T S̄ ∈ ℜ(K+1)×(K+1) is very small (i.e., dozen order), and its eigenvectors can be calculated quickly. Ac-
cording to the following equations (16) - (18), the eigenvector x of S̄ T S̄ corresponding to eigenvalue ξ can
be easily transformed to the required eigenvector u of S̄ S̄ T by u = 1√ξS̄ x.
S̄ T S̄ x = ξx, (16)
S̄ (S̄ T S̄ )x = S̄ (ξx), (17)
(S̄ S̄ T )(S̄ x) = ξ(S̄ x). (18)
Now we focus on how to construct a new sample (i.e., adaptive LM) in such orthogonal space to match
the test document. Like the linear regression model in Section 5.2, we get adaptive coefficients vr by
minimizing the sum of squared error with L2-norm regularization:
minvr
F(vr) =N∑
i=1
[ti − (µπ(i) + uTπ(i)vr)]2 + λ · ∥vr − v0∥22, (19)
where N is the number of n-grams in the first-pass transcript C, and ti = 1 is the target value for the i-th
n-gram. The subscript π(i) is the index of the i-th n-gram from the C in the shared n-gram list3, and uTπ(i) is
3If this n-gram is not in the list, it is not used. So the number N is usually smaller than the length of transcript C here.
16
the π(i)-th row of the bases matrix Ur. In the regularization term, v0 represents the projected vector of the
sample s0 in the orthogonal space: v0 = UTr (s0 −µ), where s0 is the language model via the model selection
method. This means that we hope the reconstructed LM is attracted to the vicinity of the model selection
result. Similar to Eq. (10), we also have a closed-form solution:
vr =
N∑i=1
uπ(i)uTπ(i) + λI
−1 N∑i=1
(ti − µπ(i))uπ(i) + λv0
, (20)
and the tradeoff parameter λ is also adjusted by considering the scaling of different handwritten texts:
λ =λ̃
r
r∑j=1
|M j j|, (21)
where M =∑N
i=1 uπ(i)uTπ(i), and λ̃ is selected on a validation data set. Since there is no sparse property in the
coefficients in vr, we do not try L1-norm regularization here.
Compared to the above LMA methods of either model selection or model combination, model recon-
struction by the orthogonal bases can generate a more flexible LM. However it needs a longer handwritten
text to estimate the optimized coefficients for an adaptive LM, and is limited to the shared n-gram list.
6. Language Model Compression
The above LMA methods depend on a LM set including K+1 LMs, which poses a challenge of storage.
Although we pruned each LM to a moderate size using entropy-pruning [18], the storage size of K + 1
models is still considerable for practical applications. Each LM contains two parts, namely, the n-gram
table and the probability values of each n-gram. The n-gram table is not considered in the following,
because it is fixed in every LM. In this section, we introduce two methods to compress the storage size of
probability values without removing any n-gram. The first method uses split vector quantization (SVQ) to
compress each LM separately, while the other one uses PCA to compress the whole LM set jointly.
6.1. SVQ Compression
In each language model, there are many similar probability values, such as the probabilities of similar
n-grams. These probabilities can be clustered to a group of prototypes with little loss of precision. This
17
motivates us to use the SVQ technique [36] to compress each LM. Figure 4 shows the diagram of this
method for one vector formed by the probability values of all n-grams.
Figure 4: Diagram of split vector quantization (SVQ) compression.
According to the back-off representation of an n-gram model Eq. (2), these parameters can be repre-
sented by a vector set ϕ = {p∗n,p∗n−1,αn−1, . . . ,p∗1,α1}, where each vector represents one group of parameters
(e.g., p∗n−1 represents probabilities of (n − 1)-grams and αn−1 represents scaling factors of (n − 1)-grams),
and can be compressed by SVQ. Let’s look at one example of SVQ compression for the vector p∗n, and the
compression of the other vectors is similar. We first split the original high-dimensional vector into a low-
dimensional subspace, i.e., the original d-dimensional vector (p∗n) is equally partitioned into Q sub-vectors
of dQ-dimensionality (p∗n1,p∗n2, . . . ,p
∗nQ), where d = Q × dQ
4. Then, these sub-vectors are clustered into a
small set of L prototypes by the k-means clustering algorithm to form a codebook. The codebook as well
as the corresponding indices of prototypes for all sub-vectors are stored for LM reconstruction. We can see
that the data quantization method in [19] is a special case of the SVQ compression by setting dQ = 1.
During the recognition process, the probability of one n-gram in a sub-vector is mapped by the corre-
sponding element of the prototype according to the index of the sub-vector. Lower mapping error can be
ensured by using more prototypes (L) and lower dimensionality of the subspace (dQ), which will lead to a
larger storage size, however. In our experiments, we found that dQ = 2 and L = 256 for the compression of
each LM leads to a good tradeoff between the size and recognition performance.
4when d is not the integer times of dQ, some dummy elements can be added to make d the integer times of dQ.
18
6.2. PCA Compression
Representing multiple LMs as a data matrix S = [s0, s1, . . . , sK], where the k-th column (sk ∈ ℜd)
represents the n-gram probabilities from the k-th LM on a shared n-gram list. The shared list is constructed
from all text corpora, and the size is referred to d (i.e., the number of all n-grams). Typically, the numbers
K and d are the orders of dozen and million, respectively. In such a matrix, there are many correlated
elements, e.g., the values of common n-grams in different LMs or similar n-grams in one LM. We use the
PCA technique to remove such correlations and compress the redundancy. Figure 5 shows the diagram of
the PCA compression.
Figure 5: Diagram of principal component analysis (PCA) compression.
A data vector s can be projected onto an orthogonal space of r-dimensionality by PCA:
vr = UTr (s − µ), (22)
where the projected vector vr denotes the coefficients on the first r principal components, µ is the sample
mean vector, and Ur is the same bases matrix as in section 5.3. Giving the coefficients vector vr, the
original data vector can be easily approximated according to Eq. (15). Finally, all the K + 1 samples can be
approximated by the corresponding K + 1 coefficients vectors, r bases vectors and one sample mean vector,
as illustrated in Fig. 5.
Note this PCA compression does not work for word class bi-gram (wcb) models, because it is impossible
to construct the shared list of word class bi-gram from all text corpora (the class index represents different
words in different wcb models).
Moreover, we can compress each vector (e.g., mean vector, eigenvector) further by the SVQ method,
which gives a combination of PCA and SVQ compression to produce a much smaller storage size.
19
7. Experimental Results
We evaluated the performance of our unsupervised LMA approaches on two databases: a large database
of unconstrained Chinese handwriting, CASIA-HWDB [20], and a small data set, HIT-MW [21], both are
free to download for research5. All the experiments were run on a desktop computer with 3.10 GHz CPU,
programming using Microsoft Visual C++ 2005.
7.1. Database and Experimental Setting
The CASIA-HWDB database contains both isolated characters and unconstrained handwritten texts, and
is divided into a training set of 816 writers and a test set of 204 writers. The training set contains 3,118,447
isolated character samples of 7,356 classes and 4,076 pages of handwritten texts (including 1,080,017 char-
acter samples). We tested on the 1,015 handwritten pages in the test set, which were segmented into 10,449
text lines and there are 268,629 characters. That is to say, each page contains 265 characters on average.
The HIT-MW data set has a test set of 383 text line images containing 8,448 characters (on average, 22
characters in each text line), and each text line is treated as a handwritten text page in this set.
One typical HCTR system includes the models of character classifier, geometric model and language
model, together with the combination weights of these models. In our system, we use a modified quadratic
discriminant function (MQDF) [44] as the character classifier on the normalization-cooperated gradient fea-
tures (NCGF) [45] extracted from each gray-scale character image. The parameters of MQDF were learned
from 4/5 samples of training set (including the isolated samples and the character samples segmented from
the text pages), and the remaining 1/5 training samples were used for confidence parameter estimation. For
parameter estimation of the geometric models, we extracted geometric features from all the 41,781 text lines
of training text pages. The generic language models were trained on a large general text corpus containing
about 50.8 million characters (about 32.7 million words) from the Chinese Linguistic Data Consortium.
On obtaining such context models, the combining weights were learned on 300 training text pages. These
settings are the same as those in our previous work [4]. Another 200 training text pages were used as the
5http://www.nlpr.ia.ac.cn/databases/handwriting/Home.html and http://code.google.com/p/hit-mw-database/.
20
held-out data to set the hyper-parameters, e.g., regularization term weight λ̃ in Eq. (11) and the number of
principal components r in PCA. Finally, the obtained parameter values are listed in Table 2, where the value
in the parentheses of the last row is for the weight λ6 in iwc model (see Table 1). Recall that PCA does not
work for wcb as explained in Section 6.2, its values are denoted as ’N/A’ in Table 2.
Table 2: Parameter values obtained by MCA training [4] or cross-validation.
context weights in Eq. (1) regularization term weight # of principalmodel λ1 λ2 λ3 λ4 λ5 λ̃ (L2) λ̃ (L1) λ̃ (PCA) components r
cbi 0.021 0.088 0.151 0.073 0.184 1.0 0.10 15 6cti 0.021 0.084 0.147 0.084 0.186 0.3 0.10 10 3wbi 0.031 0.011 0.178 0.160 0.163 1.0 0.15 10 3wcb 0.030 0.010 0.170 0.163 0.154 4.0 0.70 N/A N/Aiwc 0.034 0.013 0.177 0.127 0.085 (0.083) 1.0 0.15 10 3
To prepare a pre-defined set of LMs to match different handwritten texts, we extracted 14 corpora of
various domains from the web pages provided by the Sogou Labs. All the texts were segmented into word
sequences using the ICTCLAS6 toolkit for word-level LMs, and further, we clustered the words into a
number of word classes using the algorithm of [46]. In addition, an ancient domain corpus (without word
segmentation due to the unavailability of ancient domain word lexicon) was collected from the Internet.
Table 3 shows the statistics of characters, words, character classes, word classes and word clusters in each
corpus. We can see that the corpus of news domain is the largest, which includes about 418 million charac-
ters and 265 million words, and is even much larger than the general one. On the other hand, the texts of
ancient domain are much fewer, but about 8.22 million characters are enough to get an appropriate character
bi-gram or tri-gram model using the SRILM [7] toolkit.
We evaluate the recognition performance using two character-level accuracy metrics as in our previous
paper [4]: Correct Rate (CR) and Accurate Rate (AR):
CR = (Nt − De − S e)/Nt,
AR = (Nt − De − S e − Ie)/Nt,
(23)
where Nt is the total number of characters in the ground truth transcript. The numbers of substitution errors
(S e), deletion errors (De) and insertion errors (Ie) are calculated by aligning the recognized result string with
6Institute of Computing Technology, Chinese Lexical Analysis System: http://ictclas.org/
21
Table 3: Statistics of characters, words, character classes, word classes and word clusters in each corpus.
domains LMs characters (million) words (million) character classes word classes word clustersgeneral LM0 50.8 32.7 7356 281,680 1000news LM1 418 265 6699 454,370 1000
business LM2 333 202 6474 473,792 1000sport LM3 227 149 6789 234,130 1000house LM4 118 73.4 6231 254,659 1000
entertain LM5 106 71.5 5882 144,246 500it LM6 54.1 33.1 5628 156,728 500
Olympic LM7 52.2 33.0 6048 144,390 500women LM8 44.0 29.5 5569 94,409 350
auto LM9 32.4 20.4 5153 105,487 500travel LM10 31.4 20.1 5755 133,731 500health LM11 31.2 20.2 5590 99,207 350
learning LM12 28.0 17.5 5548 104,282 350culture LM13 20.0 13.4 5791 104,162 350military LM14 15.3 9.47 4854 69,683 250ancient LM15 8.22 — 7318 — —
the ground truth transcript by dynamic programming. Vinciarelli et al [39] suggested that the AR (called
recognition rate there) is an appropriate measure for document transcription, while CR (called accuracy rate
there) is a good metric for tasks of content modeling (e.g., document retrieval).
7.2. Baseline System Performance
To show the effectiveness of unsupervised LMA in HCTR, we first give the results of our baseline HCTR
system which does not use any LMA techniques. In this system, the maximum number of concatenated
segments, candidate number of character classification and beam width in the refined beam search algorithm
(without CCA: candidate character augmentation) are set as 4, 20 and 10, respectively, and the results
of our beam search method were shown to be very close to the optimal solution guaranteed by dynamic
programming under the baseline path evaluation function [4]. To show the importance of various domain
texts, we also trained a generic LM from the union of all text corpora in Table 3, besides that only from the
general text corpus, and the recognition results of all LM types on the CASIA-HWDB test set are shown in
Table 4.
From Table 4, we can observe three points. First, LM plays an important role in HCTR system, which
improves the recognition accuracy more than 10 percents (”w/o” means the results without any LM.). Sec-
ond, the recognition time on 1,015 test pages is reduced drastically from 9.72h to 0.04h if excluding the
22
Table 4: Recognition results of baseline system without LMA on CASIA-HWDB test set.
LM from general text corpus LM from all text corporaLMs CR(%) AR(%) Time(h) CR(%) AR(%) Time(h)w/o 79.43 77.34 0.04 (9.72)cbi 90.28 89.57 0.08 90.52 89.78 0.08cti 90.81 90.21 0.14 91.28 90.66 0.15wbi 90.99 90.34 0.62 91.42 90.76 0.65wcb 90.81 90.11 0.60 91.27 90.57 0.65iwc 91.22 90.58 0.65 91.63 90.98 0.67
times of over segmentation and character classification, indicating that our path search process is very ef-
ficient. Hence, by storing all character classification results (about 200KB memory) after the first-pass
recognition, we can make our LMA framework of two-pass recognition have little additional computational
cost. To show the effects of different language models, we only report the time on the part of path search in
the following experiments. Last, the LM trained from the union of all text corpora yields higher recognition
accuracy than that only from the general corpus, though it has much larger size due to the increased number
of n-grams. In the following, the baseline results refer to those of the generic LM only from the general text
corpus by default. We will see that LMA leads to higher performance than the generic models from both
the general text corpus and the union of all corpora.
7.3. Results of Language Model Adaptation
In this section, we first show the results of three unsupervised LMA methods of model selection, model
combination and model reconstruction, evaluated on the CASIA-HWDB test set. Because word-level LMs
are not available for ancient domain, we use the adaptive cti model instead of wbi, wcb and iwc model in
the second-pass recognition. In the end, we also report results on the HIT-MW test set using selected LMA
methods.
7.3.1. Results of LMA using Model Selection
Table 5 shows the results of LMA using model selection. Compared to the baseline results in Table 4,
both CR and AR are improved by the model selection adaptation for all LM types, and the improvement of
cti is the largest (about 1.1 percent up). Even compared to the baseline results of the generic LM trained
from all text corpora, the improvement by model selection adaptation is impressive. This demonstrates
23
the importance of domain-matched LM in the recognition of variable documents. On the other hand, the
processing time (path search part) is about doubled due to the two-pass recognition strategy.
Table 5: Results of LMA using model selection on CASIA-HWDB test set.
LMs CR(%) AR(%) Time(h)cbi 91.26 90.59 0.17cti 91.92 91.37 0.26wbi 91.91 91.3 1.03wcb 91.73 91.06 1.14iwc 92.14 91.53 1.21
7.3.2. Results of LMA using Model Combination
In Section 5.2, we introduced three methods to estimate the weights of model combination for LMA,
namely, heuristic method, minimizing squared error (MSE) with L2-norm and L1-norm regularization. The
recognition results on CASIA-HWDB test set are shown in Table 6.
Table 6: Results of LMA using model combination with all LMs on CASIA-HWDB test set.
heuristic method MSE with L2-norm MSE with L1-normLMs CR(%) AR(%) Time(h) CR(%) AR(%) Time(h) CR(%) AR(%) Time(h)cbi 91.33 90.63 1.01 91.68 90.78 0.99 91.63 90.71 0.33cti 92.20 91.62 2.50 92.42 91.72 2.53 92.32 91.64 0.80wbi 92.23 91.63 1.43 92.18 91.40 1.46 92.11 91.36 1.21wcb 91.82 91.12 1.54 91.70 90.82 1.68 91.67 90.89 1.30iwc 92.28 91.67 2.01 92.22 91.41 2.01 92.24 91.45 1.54
First, compared to the results of model selection adaptation in Table 5, we can see that the recognition
accuracy is improved further by model combination, especially the MSE method with L2-norm regular-
ization for character-level LMs (i.e., cbi and cti). This demonstrates the benefits of the proposed model
combination method, improving CR from 90.81% to 92.42% and AR from 90.21% to 91.72% for cti.
Next, we compare the results of three weight learning methods for model combination. Table 6 shows
that the results of MSE methods are better than those of heuristic method for character-level LMs (i.e.,
cbi and cti), whereas for word-level LMs (i.e., wbi, wcb and iwc), the heuristic method performs slightly
better. This is because the character sequence of the first-pass recognition output is more reliable than word
sequence with possible word segmentation errors, and there are usually more characters than words in the
recognized text. Thus, learning weights from such short recognized text for character-level LMs is more
24
robust than that for word-level LMs. Moreover, we can see that the results of two regularization methods in
MSE combination are comparable, while the L1-norm regularization shows the benefit of less processing
time because it selects fewer LMs to combine, justifying the model sparsity property.
Finally, to speed up the MSE with L2-norm regularization, we also evaluate the performance of combin-
ing fewer LMs, which are selected according to the minimum perplexity criterion Eq. (4). Figure 6 shows
the results of different numbers of LMs used in the model combination by MSE with L2-norm for all LM
types. We can see that combining three LMs gives a good tradeoff between the recognition accuracy and
speed. Further, we show the results of combining three selected LMs using both the heuristic method and
the MSE with L2-norm in Table 7. We can see that the recognition accuracies of combining three selected
LMs are comparable to those of combining all LMs in Table 6, while the processing time (path search part)
is reduced significantly, especially for the character-level LMs (reduced from 2.53h to 0.49h for cti).
(a) (b) (c)
Figure 6: The results of combining different numbers of LMs by the MSE with L2-norm regularization on CASIA-HWDB test set.
Table 7: Results of LMA using model combination with only three selected LMs on CASIA-HWDB test set.
heuristic method MSE with L2-normLMs CR(%) AR(%) Time(h) CR(%) AR(%) Time(h)cbi 91.48 90.81 0.26 91.69 90.85 0.26cti 92.27 91.71 0.51 92.37 91.73 0.49wbi 92.23 91.63 1.18 92.24 91.51 1.13wcb 91.90 91.22 1.23 91.83 91.05 1.26iwc 92.35 91.75 1.30 92.34 91.60 1.31
7.3.3. Results of LMA using Model Reconstruction
Table 8 shows the results of LMA using model reconstruction, where the number of principal compo-
nents was empirically set as 6, 3 and 3 for cbi, cti and wbi, respectively. Recall that we do not have a
shared n-gram list to construct the orthogonal bases for wcb, this method does not work for wcb. In the
25
iwc model, only the wbi is reconstructed, and the wcb uses the model of minimum perplexity. Compared
to the baseline results in Table 4, we can see evident improvement of recognition accuracy by model recon-
struction in HCTR. For the cti model, the CR is improved from 90.81% to 92.06%, and the AR is improved
from 90.21% to 91.49%. However, the performance of model reconstruction is inferior to that of the model
combination methods in Table 6, though it is slightly better than the performance of model selection in
Table 5.
Table 8: Results of LMA using model reconstruction on CASIA-HWDB test set.
LMs CR(%) AR(%) Time(h)cbi 91.39 90.71 0.23cti 92.06 91.49 0.44wbi 92.10 91.49 1.17iwc 92.14 91.55 1.32
7.3.4. Performance on the HIT-MW Test Set
Finally, we show the recognition results of our LMA methods on the HIT-MW test set, where each
document (a text line image) contains only 22 characters on average. According to the results on the larger
CASIA-HWDB test set, the model combination method is to use three selected LMs of lowest perplexi-
ties with the weights estimated by the MSE with L2-norm regularization (character-level models) and the
heuristic method (word-level models). All experimental setting is the same as that on CASIA-HWDB, and
the recognition results are shown in Table 9. We can see that all the LMA methods improve the recognition
accuracy, and again the model combination performs best (about 1.1 percent up of CR for cti), demonstrat-
ing the benefits of LMA even on short texts.
To verify the reliability of performance improvement, we give the results of statistical tests for compar-
isons of pairs of methods. The Wilcoxon signed-ranks test is claimed to be usually more powerful than the
t-test [47]. It ranks the differences in performances of two methods on each document, ignoring the signs,
and compares the ranks for the positive and negative differences. Here, we have four methods with 383
evaluations (CR values of the 383 documents in HIT-MW test set), and the results are shown in the ’s-test’
columns in Table 9 by the signs (’+’: significant and ’−’: non-significant at 0.05 significance level). Each
26
’s-test’ column shows the comparison results of that method with the methods in its left columns in turn.
We can see that all improvements by the LMA methods compared to the baseline system are significant
(the first sign in each column is ’+’), however, the differences of three LMA methods are not significant
(most second and third signs are ’-’) due to the insufficient estimation from these very short texts in model
combination and reconstruction.
Table 9: Results of LMA on the HIT-MW test set.
baseline model selection model combination model reconstructionLMs CR(%) AR(%) CR(%) AR(%) s-test CR(%) AR(%) s-test CR(%) AR(%) s-testcbi 91.70 90.79 92.25 91.30 + 92.52 91.41 +− 92.28 91.39 + − +cti 92.28 91.52 93.03 92.31 + 93.44 92.63 ++ 93.22 92.47 + − −wbi 92.46 91.50 93.03 92.10 + 93.37 92.48 ++ 93.21 92.26 + − −wcb 92.01 91.02 92.65 91.63 + 92.74 91.62 +− N/A N/A N/Aiwc 92.52 91.57 93.13 92.23 + 93.32 92.41 +− 93.12 92.22 + − −
7.4. Effects of Language Model Compression
The above experiments used the LMs without compression, which really consume large storage size,
especially for cti. Recall that in Section 6, we introduced the LM compression methods including SVQ of
each LM separatively, PCA reduction of whole LM set, and the combination of them (SVQ+PCA). In this
section, we evaluate these compression methods on the CASIA-HWDB test set, and the results are shown
in Table 10. Because PCA compression does not work for wcb, there are no results (denoted as ’N/A’) of
wcb for ’PCA’ and ’SVQ+PCA’, and in iwc, only the wbi is compressed by PCA. The LMA method uses
the model combination with three selected LMs of lowest perplexities (weights estimated by MSE with L2-
norm regularization). In the table, ’Size’ denotes the storage size (MB) for the whole LM set (i.e., there are
16 n-gram models for character-level, while only 15 n-gram models for word-level without ancient domain
LM. The size of iwc equals the sum of those of wbi and wcb).
From Table 10, we can see that language model compression by either SVQ or PCA yields little loss of
recognition accuracy, while the storage size is reduced significantly, and the combination of SVQ and PCA
reduces the size further. For recognition time, the decompression of SVQ takes very little time, though PCA
needs a little time on searching an n-gram from a larger shared list. This demonstrates the effectiveness of
the proposed LM compression methods. Comparing the results of all LM types, the compression of cbi is
27
the most efficient with the ratio of about 83.4 percent from 134MB to 22.3MB. This is because both SVQ
and PCA only process the probability values of LM, and this part holds the highest ratio in cbi due to its
smallest size of n-gram table (character bi-gram) in the five LM types.
In addition, we represent the n-gram table by the trie-structure format as in our previous work [24],
which removes the repeated prefixes of n-grams, hence it is lossless for recognition accuracy. By this
method, it compresses the storage size of each LM further and speed up the search process a little. The
results are shown in the last column (denoted as ’+Trie’) of Table 10, and we can see that the storage
sizes of whole LM set are finally compressed to 14.4MB, 58.7MB, 68.7MB, 28.7MB and 97.4MB for cbi,
cti, wbi, wcb and iwc, respectively, and the average size of each cbi is compressed to 0.9MB for about 7
thousand uni-grams and 0.7 million bi-grams.
Table 10: The effects (the accuracy CR/AR (%) and Size (MB)/Time (h) given by LMA using model combination with three selectedLMs) of language model compression on the CASIA-HWDB test set.
Original SVQ PCA SVQ+PCA +TrieLMs CR/AR Size/Time CR/AR Size/Time CR/AR Size/Time CR/AR Size/Time Size/Timecbi 91.69/90.85 134/0.26 91.68/90.77 50.7/0.26 91.62/90.77 77.2/0.34 91.47/90.56 22.3/0.34 14.4/0.26cti 92.37/91.73 470/0.49 92.35/91.70 178/0.49 92.18/91.55 291/0.76 92.12/91.48 102/0.74 58.7/0.50wbi 92.24/91.51 314/1.13 92.20/91.47 147/1.13 92.18/91.46 227/1.24 92.10/91.34 102/1.20 68.7/1.05wcb 91.83/91.05 81.6/1.26 91.74/90.87 38.7/1.31 N/A N/A N/A N/A 28.7/1.36iwc 92.34/91.60 396/1.31 92.30/91.51 186/1.30 92.32/91.58 309/1.36 92.24/91.44 141/1.41 97.4/1.41
7.5. Discussions
In this section, we discuss our LMA methods in four aspects. First, we show the effects of LMA on
different domains; Second, we evaluate the proposed LMA methods on documents with different recog-
nition accuracies by the baseline system; Third, we compare various LMA methods, including the model
combination method in speech recognition; Last, we give some recognition examples.
7.5.1. Effects of LMA on Different Domains
The above experiments show that all the LMA methods improve the recognition performance remark-
ably, justifying the importance of unsupervised LMA in HCTR. Here, we further investigate the effects of
28
LMA on the documents of different domains in the CASIA-HWDB test set7. The results of cti and iwc
(the other LM types give similar effects) are shown in Fig. 7, where the LMA uses the model combination
method with three selected LMs of lowest perplexities, and the combination weights are estimated by MSE
with L2-norm regularization. We can can see that the performance improvement on the ancient domain
(indexed as 15 in Table 3) documents is the largest, this is because the language style of ancient domain
texts is very different from that of general domain. Table 11 shows the results of LMA for ancient domain
documents using character-level LMs (no word-level LMs of ancient domain are available), and the cti
model improves the recognition accuracy by nearly 7 percent. The statistical test results are also shown in
Table 11, which verify the significance (sign ’+’ in the ’s-test’ column) of the performance improvement by
the LMA method.
(a) (b)
Figure 7: Effects of LMA on each domain indexed following Table 3 (”-LMA” denotes the results using LMA method, otherwise,they are baseline results). (a) cti, (b) iwc (for ancient domain, the second pass uses adaptive cti).
Table 11: Effects of LMA for ancient domain documents.Baseline LMA Improvement
LMs CR(%) AR(%) CR(%) AR(%) CR(%) AR(%) s-testcbi 82.26 81.52 88.33 87.66 6.07 6.14 +
cti 82.13 81.43 89.06 88.46 6.93 7.03 +
7.5.2. Effects of LMA on Different Baseline Accuracies
In this part, we analyse the recognition results of each document in CASIA-HWDB test set, to investi-
gate the effects of LMA on different baseline recognition accuracies. Figure 8 shows the recognition results
of cti model, where the LMA uses model combination of three lowest perplexity LMs with the weights
7In the CASIA-HWDB test set, we gave domain label to each document, but use this information in evaluation only. There are 83
documents labeled as ancient domain, while no documents in four domains of sport, house, Olympic and women.
29
estimated by MSE with L2-norm regularization. We can see that though most documents gained improved
accuracy by LMA, there is no consistent correlation between the baseline accuracy and the effectiveness of
LMA. For a document of low baseline accuracy, if most erroneous n-grams in the first-pass recognized text
(i.e., baseline result) are meaningless for all the LMs and thus do not affect the selection of the best LM,
it gains improved accuracy (e.g., the document of index 1 in Fig. 8). On the other hand, if these erroneous
n-grams happen to belong to one domain of the pre-defined LM set, they may result in a wrong selection,
and deteriorate the recognition performance (e.g. the document of index 15 in Fig. 8).
Figure 8: Effect of LMA on documents with different baseline recognition accuracies in CASIA-HWDB test set. The indices of 1 to39 represent the 39 documents of the minimum baseline CR (lower than 70 percent), and index 40 represents those documents withbaseline CR between 70 and 75 percent (the accuracy is the average CR value of those documents), and index 41 represents thosedocuments with baseline CR between 75 and 80 percent, and so on.
7.5.3. Comparison of Various LMA methods
This paper introduces three LMA methods, namely model selection, model combination and model
reconstruction, which are abbreviated as ”MS”, ”MC” and ”MR”, respectively. In model combination, we
propose three weight estimation methods, called heuristic method (MC-H), MSE with L2-norm (MC-L2)
and MES with L1-norm regularization (MC-L1). To compare these methods more clearly, in Table 12 we
summarize the results from Table 4 to Table 8 (cti model, three selected LMs for MC-H and MC-L2), give
the perplexity (PP) of each resultant model on the ground-truth texts of CASIA-HWDB test set, and show
the results of statistical tests (’s-test’, the same method for Table 9) for comparisons of pairs of methods. We
also compare with the model combination method with weights learned by maximum likelihood (MC-ML)
estimation [17] and the LMA method in [25], with their results shown in Table 12.
The results of the LMA method in [25] uses the ratio of string probability by each LM as the corre-
sponding weight. We can see that its results are similar to those of MS, this is because the string probability
30
Table 12: Summarized results of LMA using cti model on CASIA-HWDB test set.
Base MS MC-H MC-L2 MC-L1 MR MC-ML Ref [25]CR(%) 90.81 91.92 92.27 92.37 92.32 92.06 92.28 91.98AR(%) 90.21 91.37 91.71 91.73 91.64 91.49 91.71 91.43Time(h) 0.14 0.26 0.51 0.49 0.80 0.44 0.53 0.51
PP 252 109 95 48 42 118 93 110s-test N/A + ++ + + + + + −+ + + + + + + + − + −+ + − + + + − +
ratio of the best LM is usually close to one when the string is longer than ten characters, while the other
weights approach zero. The MC-ML method is commonly used in speech recognition8 [17]. Like the
MC-H method, the MC-ML estimation is also based on the perplexity, hence their results (both recognition
accuracy and perplexity) are similar.
From Table 12, we can see that our proposed LMA methods give better performance than the baseline
system in both recognition accuracy and perplexity. LMA gives a PP reduction of 53-83 percents, which
implies that the adapted LM matches the test document much better. Among the LMA methods, the MC
method reduces PP much more than the methods of MS and MR. This is because MC improves the general-
ization of each component model, and the performance of MR is limited to the shared n-gram list. Further
comparing all the MC methods, the perplexity of MC-H is much worse than those of MC-L2 and MC-L1.
The benefit of MC-L2 and MC-L1 is attributed to the fact that the object of MSE training makes each
resultant probability closer to one, and the solution is global optimal.
The comparison of recognition accuracies can lead to similar conclusion as that of perplexities, though
the improvement of accuracy is not as large as the reduction of perplexity. To verify the reliability of
performance improvement, each sign sequence in the last row of Table 12 gives the statistical tests results
for comparisons of that method with the methods in its left columns in turn (’+’: significant and ’−’: non-
significant at 0.05 significance level). We can see that all improvements by the LMA methods compared
to the baseline system are significant, and the MC-L2 method significantly outperforms all other methods,
including the MC-ML (its fourth sign is ’+’ and the correct rate (CR) is lower than that of MC-L2).
8It is difficult to use MAP estimation here due to no prior information of the dynamically selected LMs.
31
7.5.4. Recognition Examples
According to the error analysis in our previous paper [4], this paper focuses on correcting the path search
failure by giving an adaptive, more accurate language model in the path evaluation function. Figure 9 shows
three examples of strings, where the first two samples (Fig.9(a) and 9(b)) were also given in our previous
paper [4] as error examples. In such three examples, we can see that the second one (Fig.9(b)) is corrected
by the proposed LMA method (cti model with model combination adaption), demonstrating its benefits in
HCTR; while the first one (Fig.9(a)) is failed because true character class is out of the candidate character
class set; in the third example (Fig.9(c)), the candidate set includes the true character class, but it is finally
recognized wrongly due to the limitation of the n-gram model, which only considers short-distance context.
(a) (b)
(c)
Figure 9: Three recognition examples. (a) the error is not corrected due to the imperfect candidate character class set, (b) the error iscorrected by LMA, (c) the error is not corrected due to the short-distance context limitation of n-gram model (it ignores the semanticcorrelation of two characters indicated by the arrows, here both mean ’let’ and the green one is recognized correctly.). In each example,line 1 is the character string image, line 2 is the result of baseline system, line 3 is the result of LMA, and line 4 is the ground truth.
8. Conclusion
This paper presents an unsupervised language model adaptation (LMA) framework for handwritten
Chinese text recognition. Based on two-pass recognition strategy, we propose three methods to dynamically
generate an adaptive language model (LM) to match the test document via a pre-defined multi-domain LM
set, namely, model selection, model combination and model reconstruction. The experimental results show
that the model combination of three selected LMs performs best, considering the tradeoff between the
recognition accuracy and speed, by learning the combination weights via minimizing the sum of squared
error (MSE) with L2-norm regularization. The results on both CASIA-HWDB test set and HIT-MW test
32
set (very short texts) justify the benefits of the proposed unsupervised LMA method, especially for ancient
domain documents with the recognition accuracy improved by 7 percent. Aiming for practical applications,
we compress all LMs largely with little loss of accuracy using split vector quantization (SVQ) and principal
component analysis (PCA). The proposed methods are general enough such that they are applicable to
recognize documents of other languages (such as English, Arabic) as well.
The analysis of recognition errors indicates that further research efforts are needed to improve character
classification and language modeling. Better character classifiers can improve the tradeoff between the
number of candidate classes and the probability of including the true class. The path evaluation criterion
can be improved by using better language models that fuse as much as possible of the syntactic, semantic,
and pragmatic characteristics for the recognition task, i.e., considering long-distance context.
[1] R.-W. Dai, C.-L. Liu, B.-H. Xiao, Chinese Character Recognition: History, Status and Prospects, Frontiers of Computer Science
in China 1 (2) (2007) 126-136.
[2] H. Fujisawa, Forty Years of Research in Character and Document Recognition—An Industrial Perspective, Pattern Recognition
41 (8) (2008) 2435-2446.
[3] T.-H. Su, T.-W. Zhang, D.-J. Guan, H.-J. Huang, Off-Line Recognition of Realistic Chinese Handwriting Using Segmentation-
Free Strategy, Pattern Recognition 42 (1) (2009) 167-182.
[4] Q.-F. Wang, F. Yin, C.-L. Liu, Handwritten Chinese Text Recognition by Integrating Multiple Contexts, IEEE Trans. Pattern Anal.
Mach. Intell. 34 (8) (2012) 1469-1481.
[5] J.R. Bellegarda, Statistical Language Model Adaptation: Review and Perspectives, Speech Communication 42 (1) (2004) 93-108.
[6] A. Sethy, P.G. Georgiou, B.Ramabhadran, S.Narayanan, An Iterative Relative Entropy Minimization-Based Data Selection Ap-
proach for n-Gram Model Adaptation, IEEE Trans. Audio, Speech, Language Processing 17 (1) (2009) 13-23.
[7] A. Stolcke, SRILM—an extensible language modeling toolkit, Proc. 7th ICSLP, 2002, pp. 901-904.
[8] J.R. Bellegarda, Exploiting Latent Semantic Information in Statistical Language Modeling. Proceedings of the IEEE 88 (8) (2000)
1279-1296.
[9] D. Mrva, P.C. Woodland, Unsupervised Language Model Adaptation for Mandarin Broadcast Conversation Transcription, Proc.
7th Interspeech, 2006, pp. 2206-2209.
[10] Y.-C. Tam, T. Schultz, Correlated Latent Semantic Model for Unsupervised LM Adaptation, Proc. 32th ICASSP, 2007, pp. 41-44.
[11] M. Bacchiani, B. Roark, Unsupervised Language Model Adaptation, Proc. 28th ICASSP, 2003, pp. 224-227.
[12] G. Tur, A. Stolcke, Unsupervised Language Model Adaptation for Meeting Recognition, Proc. 32th ICASSP, 2007, pp. 173-176.
[13] P. Xiu, H.S. Baird, Whole-Book Recognition, IEEE Trans. Pattern Anal. Mach. Intell. 34 (12) (2012) 2467-2480.
33
[14] D.-S. Lee, R. Smith, Improving Book OCR by Adaptation Language and Image Models, Proc. 10th Int. Workshop on Document
Analysis Systems, 2012, pp. 115-119.
[15] P.C. Woodland, T. Hain, G.L. Moore, T.R. Niesler, D. Povey, A. Tuerk, E.W.D. Whittaker, The 1998 HTK Broadcast News
Transcription System: Development and Results, Proc. DARPA Broadcast News Workshop, 1999, pp. 265-270.
[16] C. Allauzen, M. Riley, Bayesian Language Model Iterpolation for Mobile Speech Input, Proc. 12th Interspeech, 2011, pp. 1429-
1432.
[17] X. Liu, M.J.F. Gales, P.C. Woodland, Use of Contexts in Language Model Interpolation and Adaptation, Computer Speech and
Language 27 (2013) 301-321.
[18] A. Stolcke, Entropy-Based Pruning of Backoff Language Models, Proc. DARPA Broadcast News Workshop, 1998, pp. 270-274.
[19] E.W.D. Whittaker, B. Raj, Quantization-based Language Model Compression, Proc. 7th Eurospeech, 2001, pp.33-36.
[20] C.-L. Liu, F. Yin, D.-H. Wang, Q.-F. Wang, CASIA Online and Offline Chinese Handwriting Databases, Proc. 11th ICDAR,
2011, pp. 37-41.
[21] T.-H. Su, T.-W. Zhang, D.-J. Guan, Corpus-Based HIT-MW Database for Offline Recognition of General-Purpose Chinese
Handwritten Text, Int. J. Document Analysis and Recognition 10 (1) (2007) 27-38.
[22] L. Sirovich, M. Kirby, Low-dimensional precedure for the characterization of human faces, J. Optical Society of America A 4
(3) (1987) 519-524.
[23] T.F. Cootes, C.J. Taylor, D.H. Cooper, J. Graham, Active Shape Models-Their Training and Application, Computer Vision and
Image Understanding 61 (1) (1995) 38-59.
[24] Q.-F. Wang, F. Yin, C.-L. Liu, Improving Handwritten Chinese Text Recognition by Unsupervised Language Model Adaptation,
Proc. 10th Int. Workshop on Document Analysis Systems, 2012, pp. 110-114.
[25] Q. He, S. Chen, M. Zhao, W. Lin, A Hybrid Language Model for Handwritten Chinese Sentence Recognition, Proc. 13th ICFHR,
2012, pp. 129-134.
[26] J.F. Gao, H. Suzuki, W. Yuan, An Empirical Study on Language Model Adaptation, ACM Trans. Asian Language Information
Processing 5 (3) (2005) 209-227.
[27] F.-F. Liu, Y. Liu, Unsupervised Language Model Adaptation Incorporating Named Entity Information, Proc. 45th ACL, 2007,
pp. 672-679.
[28] D.H.-Daines, A.I. Rudnicky, Implicitly Supervised Language Model Adaptation for Meeting Transcription, Proc. NAACL-HLT,
2007, pp. 73-76.
[29] L. Chen, J.-L Gauvain, L. Lamel, G. Adda, Dynamic Language Modeling for Broadcast News, Proc. 8th ICSLP, 2004, pp.
1281-1284.
[30] W. Wang, A. Stolcke, Integrating MAP, Marginals, and Unsupervised Language Model Adaptation, Proc. 8th Interspeech, 2007,
pp. 618-621.
[31] S.J. Pan, Q. Yang, A Survey on Transfer Learning, IEEE Trans. Knowledge and Data Engineering 22 (10) (2010) 1345-1359.
34
[32] J.F. Gao, M. Zhang, Improving Language Model Size Reduction using Better Pruning Criteria, Proc. 40th ACL, 2002, pp.
176-182.
[33] B.-J. Hsu, J. Glass. Iterative Language Model Estimation: Efficient Data Structure & Algorithms, Proc. 9th Interspeech, 2008,
pp. 1-4.
[34] A. Pauls, D. Klein, Faster and smaller n-gram language models, Proc. 49th ACL-HLT, 2011, pp. 258-267.
[35] W. Law, C.F. Chen, Split-dimension vector quantization of Parcor coefficients for low bit rate speech coding, IEEE Trans. Speech
Audio Process 2 (3) (1994) 443-446.
[36] T. Long, L.W. Jin, Building compact MQDF classifier for large character set recognition by subspace distribution sharing, Pattern
Recognition 41 (2008) 2916-2925.
[37] R. Rosenfeld, Two Decades of Statistical Language Modeling: Where Do We Go from Here? Proceedings of the IEEE 88 (8)
(2000) 1270-1278.
[38] S.F. Chen, J. Goodman, An Empirical Study of Smoothing Techniques for Language Modeling, Computer Speech and Language
13 (1999) 359-394.
[39] A. Vinciarelli, S. Bengio, H. Bunke, Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical
Language Models, IEEE Trans. Pattern Anal. Mach. Intell. 26 (6) (2004) 709-720.
[40] Y.X. Li, C.L. Tan, X.Q. Ding, A Hybrid Post-Processing System for Offline Handwritten Chinese Script Recognition, Pattern
Analysis and Applications 8 (3) (2005) 272-286.
[41] D. Jurafsky, J.H. Martin, Speech and Languaue Processing, 2nd ed. Pearson Prentice Hall, 2008.
[42] Q.-F. Wang, F. Yin, C.-L. Liu, Integrating Language Model in Handwritten Chinese Text Recognition, Proc. 10th ICDAR, 2009,
pp. 1036-1040.
[43] J. Friedman, T. Hastie, H. Hofling, R. Tibshirani, Pathwise Coordinate Optimization, The Annals of Applied Statistics 1 (2)
(2007) 302-332.
[44] F. Kimura, K. Takashina, S. Tsuruoka, Y. Miyake, Modified Quadratic Discriminant Functions and The Application to Chinese
Character Recognition, IEEE Trans. Pattern Anal. Mach. Intell. 9 (1) (1987) 149-153.
[45] C.-L. Liu, Normalization-Cooperated Gradient Feature Extraction for Handwritten Character Recognition, IEEE Trans. Pattern
Anal. Mach. Intell. 29 (8) (2007) 1465-1469.
[46] S. Martin, J. Liermann, H. Ney, Algorithms for Bigram and Trigram Word Clustering, Speech Communication 24 (1998) 19-37.
[47] J. Dems̆ar, Statistical Comparisions of Classifiers over Multiple Data Sets, Journal of Machine Learning Research 7 (2006) 1-30.
35