handwritten chinese text line segmentation by clustering with distance metric learning

Pattern Recognition 42 (2009) 3146 -- 3157

Contents lists available at ScienceDirect

Pattern Recognition

journal homepage: www.e lsev ier .com/ locate /pr

Handwritten Chinese text line segmentation by clusteringwithdistancemetric learning

Fei Yin∗, Cheng-Lin LiuNational Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, 95 Zhongguancun East Road, Beijing 100190, PR China

A R T I C L E I N F O A B S T R A C T

Article history:Received 7 August 2008Received in revised form 21 November 2008Accepted 20 December 2008

Keywords:Handwritten text line segmentationClusteringMinimal spanning tree (MST)Distance metric learningHypervolume reduction

Separating text lines in unconstrained handwritten documents remains a challenge because the hand-written text lines are often un-uniformly skewed and curved, and the space between lines is not obvious.In this paper, we propose a novel text line segmentation algorithm based on minimal spanning tree(MST) clustering with distance metric learning. Given a distance metric, the connected components (CCs)of document image are grouped into a tree structure, from which text lines are extracted by dynamicallycutting the edges using a new hypervolume reduction criterion and a straightness measure. By learningthe distance metric in supervised learning on a dataset of pairs of CCs, the proposed algorithm is maderobust to handle various documents with multi-skewed and curved text lines. In experiments on adatabase with 803 unconstrained handwritten Chinese document images containing a total of 8,169 lines,the proposed algorithm achieved a correct rate 98.02% of line detection, and compared favorably to othercompetitive algorithms.

© 2009 Elsevier Ltd. All rights reserved.

1. Introduction

Text line segmentation from document images is one of themajor problems in document image analysis. It provides crucialinformation for the tasks of text block segmentation, character seg-mentation and recognition, and text string recognition. Whereas thedifficulty of machine-printed document analysis mainly lies in thecomplex layout structure and degraded image quality, handwrittendocument analysis is difficult mainly due to the irregularity of lay-out and character shapes originated from the variability of writingstyles. For unconstrained handwritten documents, text line seg-mentation and character segmentation-recognition are not solvedthough enormous efforts have been devoted to them and greatadvances have been made.

Text line segmentation of handwritten documents is much moredifficult than that of printed documents. Unlike that printed docu-ments have approximately straight and parallel text lines, the lines inhandwritten documents are often un-uniformly skewed and curved.Moreover, the spaces between handwritten text lines are often notobvious compared to the spaces between within-line characters, andsome text lines may interfere with each other. Therefore, many text

∗ Corresponding author. Tel.: +861062632251.E-mail addresses: [email protected] (F. Yin), [email protected] (C.-L. Liu).

0031-3203/$ - see front matter © 2009 Elsevier Ltd. All rights reserved.doi:10.1016/j.patcog.2008.12.013

line detection techniques, such as projection analysis [1–7] andK-nearest neighbor connected components (CCs) grouping [12–14],are not able to segment handwritten text lines successfully. Fig. 1shows an example of unconstrained handwritten Chinese documentwith segmentation results by the X–Y cut algorithm [1], the strokeskew correction algorithm [6], the Docstrum algorithm [12] and thepiece-wise projection algorithm [5]. In this case, we can see that theX–Y cut algorithm and the stroke skew correction algorithm succeedin detecting the text lines, but fail to locate the boundaries of textlines. The Docstrum algorithm can locate the boundaries of text linesvery well, but fails to detect some lines (the first and fourth lines inFig. 1(c)) correctly because of the anomalous size of characters.Although the piece-wise projection algorithm can overcome theaforementioned errors, it fails to segment some small-size CCs (thefirst and eighth lines in Fig. 1(d)).

Many efforts have been devoted to the difficult problem of hand-written text line segmentation [1–28]. The methods can be roughlycategorized into three classes: top-down, bottom-up, and hybrid.Top-down methods partition the document image recursively intotext regions, text lines, and words/characters with the assumption ofstraight lines. Bottom-up methods group small units of image (pix-els, CCs, characters, words, etc.) into text lines and then text regions.Bottom-up grouping can be viewed as a clustering process, whichaggregates image components according to proximity and does notrely on the assumption of straight lines. Hybrid methods combinebottom-up grouping and top-down partitioning in different ways.

http://www.sciencedirect.com/science/journal/pr

http://www.elsevier.com/locate/pr

mailto:[email protected]

mailto:[email protected]

F. Yin, C.-L. Liu / Pattern Recognition 42 (2009) 3146 -- 3157 3147

Fig. 1. An example of handwritten document with text lines segmented by the X–Y cut algorithm (a), the stroke skew correction algorithm (b), the Docstrum algorithm (c)and the piece-wise projection algorithm (d).

All the three approaches have their disadvantages. Top-down meth-ods do not perform well on curved and overlapping text lines. Theperformance of bottom-up grouping relies on some heuristic rules orartificial parameters, such as the between-component distance met-ric for clustering. On the other hand, hybridmethods are complicatedin computation, and the design of a robust combination scheme isnon-trivial.

In this paper, we propose an effective bottom-up method fortext line segmentation in unconstrained handwritten Chinese docu-ments. Our approach is based on minimal spanning tree (MST) clus-tering of CCs and the distance metric between CCs is designed bysupervised learning. The number of clusters, namely the number oftext lines, is automatically decided by a new hypervolume reductioncriterion. Except for some empirical parameters in pre-processingof CCs and in post-processing of text lines, the clustering algorithmitself has no artificial parameters. The experimental comparison ofclustering with metric learning with that of artificially designedmet-ric shows that supervised metric learning improves largely the ac-curacy of text line segmentation. The proposed method was alsocompared with other state-of-the-art methods in experiments on alarge database of handwritten Chinese documents and its superioritywas demonstrated. By customizing the between-component featuresand training with documents of specific languages, we suggest thatthe proposed method is also applicable to the documents of otherlanguages.

The rest of this paper is organized as follows. In Section 2, wegive a brief review of the related works; An overall description of ourclustering-based text line segmentation method is given in Section 3,and the distance metric learning scheme is elaborated in Section 4.In Section 5, we present the hypervolume reduction criterion and thestraightness measure for text line grouping. Experimental results arepresented in Section 6 and concluding remarks are given in Section 7.

2. Previous works

The structure of a document image is a hierarchy of text regions,text lines, words, characters and CCs. Text lines can be extractedby either top-down region partitioning, bottom-up components ag-gregation, or a hybrid scheme. Some representative segmentationmethods are reviewed below.

The X–Y cut algorithm [1,2] is a typical projection-based top-down segmentation method. It uses horizontal and vertical projec-tion histograms alternately along the X and Y axis so as to partitionthe document image into a hierarchical tree structure in which eachleaf node represents a text line. Because of the assumption of paralleltext lines and significant between-line gaps, this method performswell only on printed documents. Some modified projection-basedmethods have been proposed to deal with slightly curved handwrit-ten text lines. The piece-wise projection approaches partition thedocument image into several vertical strips [3–5]. The text lines, as-sumed to be approximately straight in a strip, are extracted fromeach strip according to horizontal projection profiles and then con-nected using heuristic rules. Su et al. [6] use the horizontal strokehistogram to detect the skew of handwritten Chinese documentsand segment the text lines with the projection histogram alongthe estimated skew angle. Weliwitage et al. [7] describe a modi-fied projection-based method called cut text minimization (CTM), inwhich an optimization technique is applied to minimize the text pix-els cut while tracking the boundary between two lines after the startpoints of lines are found using projection. Similarly, Liwicki et al. [8]propose another improved projection-based method combined withslant detection, in which they use dynamic programming to find thepaths between two consecutive lines. From a different viewpoint,some researchers proposed water reservoir based top-down meth-ods [9–11]. They assume that hypothetical water flows, from bothleft and right sides of the image frame, face obstruction from

3148 F. Yin, C.-L. Liu / Pattern Recognition 42 (2009) 3146 -- 3157

characters of text lines, and the strips of areas left un-wetted onthe image frame are labeled for extracting text lines. An obviousobservation of most top-down methods is that their performancerelies on the assumption of well-separable text lines: approximatelystraight and parallel globally or locally in a region.

The Docstrum method of O'Gorman [12] is typical of bottom-upgrouping. It merges neighboring CCs using rules based on the geo-metric relationship between K nearest neighbor units, and performswell on printed documents as well as handwritten documents withslightly curved lines. Under similar ideas, the Voronoi diagram com-bined with heuristic rules in [14] are used to merge CCs into textlines. In [13], each CC is represented by its vertical coordinates ofthe bounding box, and the CCs are grouped by weighted k-meansclustering under the spatial constraints of valid address lines.Likforman-Sulem and Faure [15] develop a method based on per-ceptual grouping, in which text lines are iteratively constructed bygrouping neighboring CCs according to three Gestalt criteria, namely,proximity, similarity and direction continuity. Although this methodcan integrate the local constraints with a global measure, it cannotbe applied to poor structured documents. Nicola et al. [16] use theartificial intelligence concept of production system to search foran optimal alignment of CCs into text lines. On defining the initialstate as a set of CCs in the un-segmented document and a possiblealignment (text lines) of the CCs as the goal state, they give twooperators (“merge” or “do not merge” a component to its adjacenttext lines) for traversing states under a best path search framework.The reliance of this method on heuristic rules makes it inefficientfor unconstrained handwritten documents. The grouping of compo-nents to text lines has been treated using MST clustering [17,18],in which the CCs are grouped by MST with a hand-crafted distancemetric and then the edges between text lines are deleted usingheuristic rules, whose performance relies on the distance metricbetween components and the heuristic rules. The Hough transformhas also been applied to handwritten text line detection with thegravity centers [19,20] or minima points [21] of CCs as the pointsto be fitted. Sometimes, the CCs are split into equally spaced blocksto be voted in the Hough domain [22]. In general, Hough transformbased methods need a sophisticated post-processing procedure toextract the lines and involves high computation burden.

From a different viewpoint, some researchers proposed smearing-based bottom-up methods. Shi et al. use an adaptive local connec-tivity map (ALCM) [23] or a fuzzy runlength [24], in which the valueof each pixel is the sum of all pixels in the original image withina specified horizontal distance. After thresholding the smeared im-age, the CCs represent probable regions of text lines. Kennard andBarrett use a similar method with slight extension to deal with free-form handwritten historical documents [25]. All the smearing-basedmethods, like other bottom-up ones, involve parameters to be tunedartificially. Nicolas et al. treat document image segmentation as a la-beling problem [26]. They partition the document image into a n×mgrid and construct a Markov random field (MRF) model based on thegrid, and then label the grid pixels into some states. Their resultsshow that this method does not perform robustly on handwrittendocuments.

The level set basedmethod proposed by Li et al. [27] is an effectivehybrid approach for unconstrained handwritten documents. On con-verting a binary image to gray-scaled using a continuous anisotropicGaussian kernel, the level set method is exploited to determine theboundary between neighboring text lines. Though reported high ac-curacies of text line segmentation, this method obviously suffersfrom high computational complexity.

3. Clustering based text line segmentation

In this section, we describe the rationale of our approach and theMST clustering algorithm. The distance metric learning and text line

Fig. 2. Hierarchical structure of a document image.

grouping techniques are elaborated in Sections 4 and 5, respectively.The performance of MST clustering relies on the metric of distancebetween image components. After clustering, the resulting tree iscarefully cut into subtrees each corresponding to a text line.

3.1. Rationale

A document image can be viewed as a hierarchical structure asin Fig. 2: it consists of text lines, each text line consists of CCs, anda CC is made of black runs or pixels. Equivalently, a text line canbe viewed as a cluster of stroke pixels or CCs. We prefer using CCsas the basic units of clustering because the CCs are easy to detectand the number of CCs is much smaller than that of stroke pixels.Obviously, an important feature of this clustering problem is thatall clusters (text lines) have irregular boundaries. We use the MSTalgorithm for clustering the CCs into text lines because it does notassume a spherical shaped clustering structure of the underlyingdata as many other clustering algorithms do. Two important issuesin clustering are the distance metric between units and the criterionfor determining the number of clusters.

A good metric for clustering CCs should meet the condition thatthe distance between two neighboring components in the same textline is smaller than that between different lines. In documents withclose or interfering text lines, the Euclidean distance does not satisfythis. We previously used a hand-crafted distance metric [28], whichworks fairly well but not sufficiently. We hereby design a bettermetric by supervised learning on labeled pairs of CCs. By labelingsome pairs as “close” (within the same text line) and some othersas “distant” (between lines), a distance metric can be automaticallylearned to fit the target of small distance within text line and largedistance between lines.

Under a learned distance metric, the tree generated by MST al-gorithm has the desired characteristic that the neighboring CCs ofthe same text line are connected and each line corresponds to a sub-tree (Fig. 3). However, the branches (paths between terminal andbranching nodes) do not correspond to text lines perfectly due tothe variability of layout of text lines. We hence use a second-stageclustering procedure to dynamically cut the edges of the tree intogroups corresponding to text lines.

The criterion to select the edge to cut and the criterion to stopcutting (to determine the number of clusters) are important in thesecond stage. Simply deleting the shortest edge does not promisebecause the edges between different lines (red lines in Fig. 3) are notalways longer than those within the same line. Our approach is toselect the edge to cut such that the sum of hypervolumes of clusters


Fig. 3. MST of a document image.

Fig. 4. The framework of our approach.

is reduced maximally, and to stop clustering when the measure ofstraightness of text lines reaches a maximum.

From the above description, the framework of our approach canbe depicted as in Fig. 4.

3.2. Clustering algorithm

Our algorithm starts with a binary document image. In pre-processing, the CCs are labeled using a fast algorithm based oncontour tracing [29]. Small components with few black pixels areconsidered as noises and are removed. We then estimate the dom-inant character size from the component-size histogram obtainedusing the method in [12]. Empirically, the components with heightor width larger than three times of the dominant character heightare split vertically or horizontally using the touching character split-ting method in [30] because they are most likely to contain touchingcharacters and such big components affect the result of MST clus-tering. Finally, each component is viewed as a node in a graph(document graph). Each pair of nodes is linked by an edge withthe distance between them as the weight. The metric of distance isdesigned to strengthen within-line links and weaken between-linelinks. From the weighted document graph, a MST is built using theKruskal's algorithm [31]. In the resulting tree, most edges correspondto within-line links and some correspond to between-line links.

Since the Kruskal's MST algorithm is well known and can be easilyfound in the literature, we will not give its details in this paper.

4. Distance metric learning

As many clustering algorithms rely critically on the distancemetric between pairs of input units, some recent studies have con-tributed to metric learning from data [32–34]. For improving theperformance of fuzzy c-means clustering, an evolutionary algorithmwas used to optimize the scales of the dimensions of input data set[32]. Domeniconi [33] proposed a variant of k-means algorithm inwhich individual Euclidean metric weights were learned for eachcluster. Xing et al. [34] combined gradient descent and iterativeprojections to learn a Mahalanobis metric for k-means clustering.Inspired by these works of distance metric learning, we herein de-sign our distance metric for text line segmentation by supervisedlearning.

In our work, the definition of the distance between CCs is thekey to make the generated MST have the components of the sametext line in the same subtree and those of different lines in differentsubtrees. In Fig. 5, we give an example of MST clustering based ona hand-crafted metric [28] and the one based on learned metricproposed in this paper. In the figure, we mark the between-lineedges with blue lines. We can observe that while the learned metricgroups the CCs of the same text line in the same subtree (Fig. 5(b)),the hand-crafted metric splits some text lines into multiple subtrees(Fig. 5(a)).

4.1. Problem formulation

For supervised learning of distance metric between CCs, we needsome training samples of component pairs labeled as “within-line”and “between-line”. To do this, we annotated some training docu-ment images using our ground-truthing tool GTLC (Ground-truthingtool for handwritten Chinese Text Lines and Characters) [42], whichlabels text lines and characters by automated transcript alignmentand hand correction.

Let C = {x1, x2, . . . , xn} be a collection of CCs in a training docu-ment, where n is the number of components. We obtain two sets ofcomponent pairs as the samples for metric learning:

S = {(xi, xj)|xi and xj belong to the same line},

D = {(xi, xj)|xi and xj belong to different lines}

Considering the fact that only the spatially neighboring compo-nents are linked in the MST, we can discard many component pairsfrom the sample set for accelerating metric learning. To do this, weconstruct the area Voronoi diagram [14] of the training document,which represents the spatial adjacency between the components.A component xi is the neighbor of another one xj only if they areadjacent in the Voronoi diagram. The pairs that are not adjacent areremoved from S and D.

The aim of metric learning is to make the distance between com-ponents in S small and the distance between components in D largeunder the learned metric. Hence, we formulate the problem of met-ric learning as a convex programming problem [35]:

minA∈Rm×m

∑(xi ,xj)∈S

‖xi − xj‖2A

s.t. A�0,∑

(xi ,xj)∈D‖xi − xj‖2A�1,

where A ∈ Rm×m (m is the dimensionality of the feature space char-acterizing component pairs) defines the distance metric:

d(xi, xj) = dA(xi, xj) = ‖xi − xj‖A =√vTij ∗ A ∗ vij,


Fig. 5. The results of MST clustering with hand-crafted metric (a) and learned metric (b).

and vij is the feature vector characterizing the relationship betweencomponents xi and xj. A is determined by solving the convex pro-gramming problem.

4.2. Feature space

The features characterizing the relationship between two compo-nents xi and xj are integral of the distance metric, and are influentialto the performance of clustering. Below is a list of features (eightfeatures in total) that we use.

(1) Normalized horizontal and vertical distances between the cen-troids of two components.

The horizontal/vertical distance between the centroids of two CCsmeasures the spatial closeness. For generalizing to different docu-ments (with differing font size and imaging resolution), this distanceshould be normalized with respect to the character size (divided bythe estimated dominant character size).

(2) Normalized horizontal and vertical overlapping degree.If two components overlap horizontally (align vertically), the nor-

malized horizontal overlap degree can be computed by [30]:

novlpx = 12

(ovlpW1

+ ovlpW2

)− dist

span,

where ovlp is the overlapping width of two bounding boxes, W1and W2 are the widths of the bounding boxes, dist is the horizontaldistance between the centers of two bounding boxes, and span is thespanning width of two bounding boxes (Fig. 6).

The normalized vertical overlap degree is computed similarlyfrom the heights of two bounding boxes.

(3) Normalized horizontal and vertical minimum run-length.

Fig. 6. Definition of normalized horizontal overlap.

Fig. 7. An example of minimum run-length (MRL).

The horizontal minimum run-length (MRLx) is the horizontal run-length between vertically overlapping (horizontally aligned) CCs,wherein the minimum horizontal distance between black runs istaken as the distance measure (Fig. 7). It is similarly normalized withrespect to the dominant character size of the document image.

The vertical minimum run-length (MRLy) is computed similarlyand normalized with respect to the dominant character size.

(4) Height and width ratio of merged components.


Suppose two CCs are merged, then the Height Ratio is computedby:

Rhei =max(H1,H2)

span,

where H1 and H2 are the heights of the bounding boxes, and span isthe spanning height of two bounding boxes.

The Width Ratio is computed similarity from the heights of twoCCs.

5. Text line grouping

Although the learned distancemetric encourages the componentsin the same text line to be connected in a subtree, there are stillsome components from different lines connected. Since between-line edges are not obvious because their lengths (distances betweencomponents) are not necessarily longer than the within-line edgelengths, to correctly recognize and cut the between-line edges isnon-trivial. Although several algorithms [36–39] on this problemhave been proposed, they do not perform satisfactorily in our caseof handwritten Chinese text line segmentation. The cutting resultsfor the image in Fig. 3 using the algorithms of [37,39] are shownin Fig. 8, where many cutting or connection errors occurred, whichwere pointed out by blue circles.

Our MST-based text line grouping process consists of twophases: in initial grouping the MST is cut into subtrees using ahypervolume reduction criterion and a straightness measure, andin post-processing, some remaining text line errors are correctedusing heuristic rules.

5.1. Initial grouping

We use a criterion based on hypervolume [40] for selecting edgesof MST to cut. By cutting some edges, each subtree corresponds to acluster of CCs. The sum of hypervolumes of the clusters is computed

Fig. 8. Results of edge cutting for the image in Fig. 3 using the algorithm in [37] (a) and the algorithm in [39] (b). Cutting/connection errors are marked with blue circles.(For interpretation of the references to colour in this figure legend the reader is referred to the web version of this article.)

for evaluating the partition:

Fv =k∑

i=1

[det(Ci)]1/2,

where det(Ci) is the determinant of the covariance matrix Ci of clus-ter i, computed from the constituent black pixels of the CCs in thecluster.

Initially, all the components in the MST are considered as a singlecluster, and every edge is deleted tentatively to split the cluster intotwo clusters (subtrees). The edge with the maximal reduction ofFv measure is selected to delete such that the total Fv measure ofthe document is minimized. We call this as maximum hypervolumereduction criterion, denoted by:

edgedeleted = argmaxedge

�Fv

= argmaxedge

[Fv(Sk) − Fv(Sk+1)],

where Sk = {T1, T2, . . . , Tk} denote the partition of k disjoint subtrees(S1 denotes the initial MST).

The Fv measure cannot evaluate the number of clusters since italways decreases as the number of clusters increases. Fortunately,it is reasonable to assume rectangular shapes for the text lines (if atext line is curvilinear, it can be divided into several sublines thatare approximately straight). We conjecture that when the number ofclusters (partitioned text lines) is appropriate, a measure of straight-ness of the text lines reaches a maximum. We compute the totalstraightness measure as:

Fs =k∑

i=1

(�i1

�i2

)2

,

where k is the number of clusters (partitioned text lines), �i1 and�i2 (�i1��i2) are the eigenvalues of the covariance matrix of eachcluster.

Our experiments demonstrate that the Fs measure performs wellin finding the number of clusters: the number of maximum Fs fits


Fig. 9. Partitioning criteria for the document in Fig. 3. (a) FV as a function of number of clusters; (b) FS as a function of number of clusters; (c) the partitioned text lines.

well the actual number of text lines. An example is shown in Fig. 9.By iteratively deleting edges according to FV , the total FV measureand Fs measure with increasing number of clusters are shown inFig. 9(a) and (b), respectively. We can see that k = 5, correspond-ing to the maximum of Fs, gives a preferable partition of text lines(Fig. 9(c)).

5.2. Post-processing

After initial grouping, most of text lines have been grouped cor-rectly, but a few errors may still exist. For example, some lines aresplit into several pieces because of large within-line horizontal gaps,or some CCs are falsely grouped into other lines. Most of these er-rors can be corrected using some heuristic rules similar to [27]. Thepost-processing procedure has following steps:

(1) Estimate the orientation of the initial text lines using the leastmean squared-error method, and estimate their height andwidth using the method in [30].

(2) If the length of a text line is shorter than 110 of the image width

or it contains less than three CCs and the height is larger thanhalf of the average height of all lines, it is labeled as “isolatedline”. The other text lines are labeled as “unprocessed line”.

(3) If the height of an “unprocessed line” is smaller than half of theaverage height of all lines, all of its CCs are labeled as “isolatedCC”. In an “unprocessed line”, if the distance between the cen-troid of a CC and the midline (which crosses the centroid of thetext line and has the same orientation) is larger than half of theheight of the text line, the CC is also labeled as “isolated CC”.

(4) Select the longest “unprocessed line”, we can merge it with an-other “unprocessed line” if all the following conditions are met:

(1) the difference of their orientation is less than 15◦; (2) thehorizontal gap between their bounding boxes is less than 1

10 ofthe image width; (3) their bounding boxes overlap more than50% of average height in the orthogonal of the average orienta-tion. Mark the merged line as “processed line”.

(5) Iterate Step 4 until there is no “unprocessed line”.(6) Merge an “isolated CC” to the i-th text line if the distance be-

tween the centroid of the CC and the midline of the text line issmaller than the height of the text line. If we cannot find a textline to merge the “isolated CC”, the CC is labeled as “noise” andis deleted.

(7) Similar to Step 6, merge the CCs of an “isolated line” to a“processed line”, but if we cannot find a text line to merge theCC, we keep it in the “isolated line”, and if this “isolates line” stillhave CCs after merging all CCs, we label it as “processed line”.

6. Experimental results

We evaluated the performance of our algorithm on a largedatabase of unconstrained handwritten Chinese documents andcompared with some existing reference algorithms. As follows, webriefly describe the database and evaluation methodology, outlinethe reference algorithms, and then present the experimental results.

6.1. Database

A large database of unconstrained Chinese handwritten docu-ments, HIT-MW [41], was collected by Harbin Institute of Technol-ogy and is publicly available for free use. The database contains 853text forms written by more than 780 writers. There are 8,677 textlines in total and each line has 21.51 characters on average. Each


Fig. 10. Example documents in the HIT-MW database.

document was scanned at a resolution of 300DPI. A typical imagesize is approximately 1700×1500 pixels, and each image contains530 CCs on average. Fig. 10 shows two images in this database.

Since the images in the HIT-MW database are not labeled at CCslevel (only a part of images have been segmented into text lines),we have annotated all the 853 document images using our ground-truthing tool GTLC [42].

6.2. Evaluation methodology

Several evaluation schemes have been proposed for documentimage segmentation [43–45], but they were designed for printeddocuments or graphics and to evaluate the performance based onbounding boxes. It is not appropriate to measure handwritten textlines using bounding boxes because they are often curved and multi-skewed. Therefore, we evaluate the performance by counting thenumber of matches between the pixels of detected text lines andthe pixels in the ground-truth data. Similar to [27], we calculatethe MatchScore matrix between a detected text line and a ground-truthed line:

MatchScore(i, j) = T(Gi ∩ Rj)T(Gi ∪ Rj)

,

where Gi is the set of pixels of the i-th ground-truthed text line, Rj isthe set of pixels of the j-th detected text line, T(S) is the cardinality ofset S. The Hungarian algorithm is used to find one-to-one correspon-dence between the detected text lines and the ground-truthed ones[46]. Since the number of lines in two sets may be different, either adetected line or a ground-truthed line is allowed to be matched witha dummy line. The performance is evaluated at the text line level.If a ground-truthed line and the corresponding detected line shareat least 95% of pixels, the detected text line is claimed to be correct.The percentage of correctly detected text lines out of the ground-truthed lines gives the correct detection rate (recall rate), and thepercentage of false lines out of the detected lines gives the error rate.

6.3. Reference algorithms

In addition to comparison with our previous clustering algorithmwith hand-crafted metric [28], we compared the hypervolume re-duction criterion in text line grouping with other criteria in [37,39].Then, we compared the performance of the proposed algorithm not

only with two algorithms X–Y cut [1] and Docstrum [12] that weredesigned for printed documents, but also with two algorithms strokeskew correction [6] and piece-wise projection [5] that were designedfor segmenting handwritten text lines.

The hand-crafted metric was formed using a subset of the fea-tures described in Section 4.2: it is the weighted combination of thehorizontal minimum runlength and the Euclidean distance betweenthe centroids of two CCs, with the weight determined by the nor-malized vertical overlapping degree. This empirical combination wasfound to perform fairly well. For fair comparisonwith other methods,we also optimize the weighting parameter on some ground-trutheddocument images.

After MST clustering, the algorithms in [37,39] cut edges in dif-ference ways. The one in [37] finds a global threshold of edge lengthaccording to the edge length histogram of the linkage graph, then,all the edges with length over the threshold are cut. The authors of[37] demonstrated that this method was more efficient than the al-gorithm in [36]. The algorithm in [39] measures each hypothesizedcluster (subtree) using the standard deviation of edge lengths withinthe subtree. The edge selected to cut is the one that reduces the av-erage standard deviation maximally. This is similar to our method ofhypervolume reduction but it uses deviation of edge lengths insteadof hypervolume.

As a typical top-down method, the X–Y cut algorithm [1] buildsa structural tree of the document by recursively analyzing the hor-izontal and vertical projection profiles of partitioned regions. TheDocstrum algorithm [12] builds the document structure bottom-upby merging neighboring CCs. We used in our experiments a pub-lic domain implementation of the X–Y cut and Docstrum algorithms[45]. The stroke skew correction algorithm [6] estimates the skewangle of text lines from the horizontal stroke histogram and thensegments text lines using projection profiles after deskewing. Thepiece-wise projection algorithm [5] obtains an initial set of text linesfrom the piece-wise projection profiles, then any obstructing CCs areassociated to a line above or below by a probability evaluated underGaussian assumption or a distance metric. This algorithmwas shownto perform very well in segmenting English and Arabic text lines.

To yield best performance for the above algorithms (MSTclustering with hand-crafted metric, X–Y cut, Docstrum, piece-wiseprojection), we use the Nelder–Mead simplex search method [47] tooptimize the free parameters of them based on ground-truthed dataas Mao et al. did in [45,48]. The hand-crafted metric has a weightingparameter to optimize. The X–Y cut algorithm and the Docstrum


Table 1Correct rates of text line detection using learned and hand-crafted metrics.

Correct detection

Learned metric 1051 (95.02%)Hand-crafted metric (optimized weight) 1008 (91.14%)Hand-crafted metric (empirical weight) 975 (88.16%)

Table 2Correct rates of text line detection using different clustering criteria.

Correct detection

Hypervolume reduction 1051 (95.02%)Criterion in [37] 823 (74.41%)Criterion in [39] 341 (30.83%)

algorithm each has four parameters optimized by simplex search, asdone in [45]. For the piece-wise projection algorithm, the authorsof [5] did not mention any free parameter. But in our implementa-tion, we found that two parameters, the minimal difference and theminimal distance between the neighboring peak and valley of theprojection histogram, need to be determined.

The other strip-based projection methods were not evaluated inour experiments because they rely on many heuristic rules and freeparameters, and so, it is hard to tune the rules and parameters tooptimize the performance.

6.4. Performance comparison

We conducted three experiments to compare the performanceof metric learning with hand-crafted metric, compare hypervolumereduction in text line grouping with other MST cluster criteria, andcompare the proposed text line segmentation algorithm with fourexisting methods.

6.4.1. Comparing metric learning with hand-crafted metricThe example of Fig. 5 in Section 4 demonstrates that distancemet-

ric learning can obviously improve the performance of MST cluster-ing of handwritten documents. To evaluate the performance quanti-tatively, we selected 150 images with complex layout in the HIT-MWdatabase, 50 documents were used for distance metric learning andoptimizing the weighting parameter of hand-crafted metric, and theremaining 100 documents containing 1,106 text lines were used forevaluation. Previously, the weight of hand-crafted metric was deter-mined from the normalized vertical overlapping degree [28]. In thisexperiment, all the processing steps except the distance metric arethe same.

The correct rates of text line detection by MST clustering withlearned metric and hand-crafted metric (with optimized weight andempirical weight) are shown in Table 1. We can see that distancemetric learning improves the performance of text line segmentationsignificantly. By optimizing the weighting parameter of the hand-crafted metric, the performance is also improved considerably com-pared to the one with empirical weight.

6.4.2. Comparing MST clustering criteriaTo compare the text line segmentation performance of the pro-

posed hypervolume reduction criterion with other MST clusteringcriteria in [37,39], we used the same 150 images as in Section 6.4.1:50 for metric learning and 100 for evaluation. For the three crite-ria compared, all the processing steps except the tree partitioningprocedure are the same. The correct rates of text line detection onthe 100 test images are shown in Table 2. We can see the hypervol-ume reduction criterion yields the best performance. Since the cri-terion in [37] finds a global threshold of edge length, between-line

Table 3Correct rates and error rates of text line detection on 803 images.

Detectedlines

Correct detection Errorrate (%)

Proposed (with post-processing) 8211 8008 (98.02%) 2.47Proposed (w/o post-processing) 8444 7822 (95.75%) 7.37X–Y cut [1] 8605 3682 (45.07%) 57.21Docstrum [12] 9602 5341 (65.38%) 44.38Stroke skew correction [6] 5897 4521 (55.34%) 23.33Piece-wise projection [5] 8150 7521 (92.07%) 7.72

edges shorter than the threshold cannot be deleted to separate thelinked text lines. By the criterion in [39], we observed that the localminimum of the standard deviation reduction function always givesfewer clusters than the real number of text lines. This causes manytext lines merged with each other. On the contrary, our algorithmbased on the hypervolume reduction criterion combined with thestraightness measure of text lines mostly finds the correct numberof real clusters.

6.4.3. Comparison with existing methodsTo compare the performance of the proposed MST clustering-

based text line segmentation algorithm with the X-Y cut, Docstrumalgorithm, stroke skew correction algorithm and piece-wise projec-tion algorithm, we randomly selected 50 images (containing 508 textlines) in the HIT-HW database for training (distance metric learningand parameters tuning), and the remaining 803 images (containing8,169 text lines) for evaluation. The correct rates (recall rates, per-centage of correctly detected lines out of all ground-truthed ones)and error rates (percentage of error text lines out of all detected ones)are shown in Table 3. To justify the effect of post-processing withthe proposed clustering-based method, we evaluated both the al-gorithm with post-processing and the one without post-processing.From Table 3, we can see that post-processing is effective to im-prove the correct rate of text line segmentation. However, evenwith-out post-processing, the proposed method still yields higher correctrate and lower error rate than the four existing algorithms. Amongthe existing methods, the piece-wise projection algorithm yields thebest performance. We observe that the X–Y cut and Docstrum al-gorithms tend to extract more text lines, but many of them do notmatch the ground-truthed lines. The stroke skew correction is de-signed for segmenting handwritten Chinese text lines, but it is onlybetter than X–Y cut algorithm. The piece-wise projection algorithmachieves competitive results because it can tolerate the multi-skewand curve of the text lines in a certain extent.

Fig. 11 shows the segmentation results of a document image us-ing the X–Y cut, Docstrum, stroke skew correction, piece-wise pro-jection and the proposed algorithm. From the figure, we can see thatonly the proposed algorithm segments the text lines totally correctly,while the X–Y cut, Docstrum and stroke skew correction algorithmsgenerate many false text lines. Though the piece-wise projection al-gorithm almost finds all text lines, some small CCs are falsely seg-mented such as the last text line in Fig. 11(d). Overall, the proposedalgorithm performs very well on unconstrained handwritten Chinesedocuments with multi-skewed and curved text lines.

The proposed cluster-based algorithm was implemented inC++ codes. The overall processing time for an image with size of1700×1500 pixels and containing 1000 CCs (about 500–600 char-acters) is about 2.5 seconds on a personal computer with CPU ofPentium 4–3.6GHz and 1GB Memory. This speed is neverthelessacceptable.

We could not compare our method with the level set basedmethod of Li et al. [27] because that method is non-trivial to imple-ment and their image database is not available for our evaluation.Our algorithm and the one of Li et al. were compared with the X–Y


Fig. 11. Segmentation results of a document image by X–Y cut (a), Docstrum (b), stroke skew correction (c), piece-wise projection (d) and the proposed algorithm (e).

cut and Docstrum algorithms and as the result, both yielded signifi-cantly higher correct detection rates than the X–Y cut and Docstrumalgorithms. Nevertheless, the level set based method turns out to bemore computationally demanding: according to [27], the segmenta-tion of an image of 2000×1500 pixels costs about 20 seconds on aCPU of 1.6GHz and 1G memory.

6.5. Error analysis

The proposed clustering-based method with metric learning,though performs sufficiently well, still remains some text line

detection errors. The errors are mostly of two types: (1) error linesplitting (ELS): a real text line is split into two or more lines (corre-sponding to multiple clusters); (2) error line merging (ELM): two ormore real text lines are merged into a single cluster.

We observed that the ELS occur when characters are inserted ina line, such as those (marked by blue circles) in Fig. 12.

The ELM are mainly caused by the overlapping of two neighbor-ing text lines, especially touching of characters, such as that (markedby blue circle) in Fig. 13. In this case, since two text lines are con-nected in only few touched characters, a post-processing procedureis necessary to separate them vertically.


Fig. 12. An example of error line splitting.

Fig. 13. An example of error line merging.

7. Conclusion

We propose a new method for text line segmentation in un-constrained handwritten Chinese document images based on mini-mum spanning tree (MST) clustering with distance metric learning.This bottom-up method is able to segment multi-skewed, curvedand slightly overlapping text lines. Except some empirical parame-ters (which are easy to determine and do not influence the perfor-mance critically) in pre-processing of connected components (CCs)and post-processing of text lines, this algorithm has no artificial pa-rameter in clustering. In MST clustering, the metric of distance be-tween CCs is learned on a dataset of pairs of components labeledas “within-line” or “between-line”. This avoids artificial tuning ofmetric and improves the clustering performance significantly. Thenumber of clusters is automatically determined by cutting the edgesof the generated tree using a hypervolume reduction criterion anda straightness measure. The proposed algorithm was evaluated ona large database of unconstrained handwritten Chinese documents,and was demonstrated superior to some previous algorithms. Ouralgorithm is to be further improved by refining the features of dis-tance metric and the post-processing procedure, and to be evalu-ated on document images of various languages via customizing thebetween-component features and training with document images ofspecific languages.

Acknowledgments

The authors would like to thank Tonghua Su for authorizing usto use the HIT-MW database, Zhenglong Li for discussions on dis-tance metric learning, Gang Liu and Yi Li for their suggestions on theexperiments. This research was supported by the National NaturalScience Foundation of China (NSFC) under grant nos. 60775004 and60825301.

References

[1] G. Nagy, S. Seth, M. Viswanathan, A prototype document image analysis systemfor technical journals, Computer 25 (7) (1992) 10–22.

[2] J. He, A.C. Downton, User-assisted archive document analysis for digital libraryconstruction, in: Proceedings of the Seventh International Conference onDocument Analysis and Recognition, vol. 1, 2003, pp. 498–502.

[3] A. Zahour, B. Taconet, P. Mercy, S. Ramdane, Arabic handwritten text-lineextraction, in: Proceedings of the Sixth International Conference on DocumentAnalysis and Recognition, 2001, pp. 281–285.

[4] U. Pal, S. Datta, Segmentation of Bangla unconstrained handwritten text, in:Proceedings of the Seventh International Conference on Document Analysis andRecognition, vol. 2, 2003, pp. 1128–1132.

[5] M. Arivazhagan, H. Srinivasan, S. Srihari, A statistical approach to linesegmentation in handwritten documents, in: Document Recognition andRetrieval XIV, Proceedings of the SPIE, 2007, pp. 6500T-1-11.

[6] T. Su, T. Zhang, H. Huang, Y. Zhou, Skew detection for Chinese handwritingby horizontal stroke histogram, in: Proceedings of the Ninth InternationalConference on Document Analysis and Recognition, 2007, pp. 899–903.

[7] C. Weliwitage, A.L. Harvey, A.B. Jennings, Handwritten document offline textline segmentation, in: Proceedings of Digital Image Computing: Techniques andApplications, 2005, pp.184–187.

[8] M. Liwicki, E. Indermuehle, H. Bunke, On-line handwritten text line detectionusing dynamic programming, in: Proceedings of Ninth International Conferenceon Document Analysis and Recognition, 2007, pp. 447–451.

[9] S. Basu, C. Chaudhuri, M. Kundu, M. Nasipuri, D.K. Basu, Text line extractionfrom multi-skewed handwritten documents, Pattern Recognition 40 (6) (2007)1825–1839.

[10] U. Pal, P.P. Roy, Multioriented and curved text lines extraction from Indiandocuments, IEEE Transactions on Systems, Man and Cybernetics, Part B 34 (4)(2004) 1676–1684.

[11] U. Pal, P.P. Roy, Text line extraction from India document, in: Proceeding ofFifth International Conference on Advances in Pattern Recognition, 2003, pp.275–279.

[12] L. O'Gorman, The document spectrum for page layout analysis, IEEE Transactionson Pattern Analysis and Machine Intelligence 15 (11) (1993) 1162–1173.

[13] F. Kimura, Y. Miyake, M. Shridhar, Handwritten ZIP code recognitionusing lexicon free word recognition algorithm, in: Proceeding of the ThirdInternational Conference on Document Analysis and Recognition, 1995, pp.906–910.

[14] K. Kise, A. Sato, M. Iwata, Segmentation of page images using the area Voronoidiagram, Computer Vision and Image Understanding 70 (3) (1998) 370–382.

[15] L. Likforman-Sulem, C. Faure, Extracting lines on handwritten documentby perceptual grouping, in: Advances in Handwriting and Drawing: AMultidisciplinary Approach, 1994, pp. 21–38.

[16] S. Nicola, T. Paquet, L. Heutte, Text line segmentation in handwritten documentusing a production system, in: Proceeding of the Ninth International Workshopon Frontiers in Handwriting Recognition, 2004, pp. 245–250.

[17] I.S.I. Abuhaiba, S. Datta, M.J.J. Holt, Line extraction and stroke ordering oftext pages, in: Proceeding of the Third International Conference on DocumentAnalysis and Recognition, vol. 1, 1995, pp. 390–393.

[18] A. Simon, J.-C. Pret, A.P. Johnson, A fast algorithm for bottom-up documentlayout analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence19 (3) (1997) 273–277.

[19] B. Yu, A. Jain, A robust and fast skew detection algorithm for generic document,Pattern Recognition 29 (10) (1996) 1599–1629.

[20] Y. Pu, Z. Shi, A natural learning algorithm based on Hough transform fortext lines extraction in handwritten document, in: Proceeding of the SixthInternational Workshop on Frontiers in Handwriting Recognition, 1998, pp.637–646.

[21] L. Likforman-Sulem, A. Hanimyan, C. Faure, A Hough based algorithm forextracting text lines in handwritten documents, in: Proceeding of the ThirdInternational Conference on Document Analysis and Recognition, 1995, pp.774–777.

[22] G. Louloudis, B. Gatos, I. Pratikakis, K. Halatis, A block-based Hough transformmapping for text line detection in handwritten document, in: Proceeding of the10th International Workshop on Frontiers in Handwriting Recognition, 2006,pp. 515–520.

[23] Z. Shi, S. Setlur, V. Govindaraju, Text extraction from gray scale historicaldocument image using adaptive local connectivity map, in: Proceeding of theEighth International Conference on Document Analysis and Recognition, vol. 2,2005, pp. 794–798.


[24] Z.X. Shi, V. Govindaraju, Line separation for complex document images usingfuzzy runlength, in: Proceeding of the First International Conference onDocument Image Analysis for Libraries, 2004, pp. 306–312.

[25] D.J. Kennard, W.A. Barrett, Separating lines of text in free-form handwrittenhistorical documents, in: Proceeding of the Second International Conference onDocument Image Analysis for Libraries, 2006, pp. 12–23.

[26] S. Nicolas, T. Paquet, L. Heutte, Markov random field models to extract the layoutof complex handwritten documents, in: Proceeding of the 10th InternationalWorkshop on Frontiers in Handwriting Recognition, 2006, pp. 563–568.

[27] Y. Li, Y. Zheng, D. Doermann, S. Jaeger, Script-independent text linesegmentation in freestyle handwritten document, IEEE Transactions on PatternAnalysis and Machine Intelligence 30 (8) (2008) 1–17.

[28] F. Yin, C.-L. Liu, Handwritten text line extraction based on minimal spanningtree clustering, in: Proceeding of the Fifth International Conference on WaveletAnalysis and Pattern Recognition, vol. 3, 2007, pp. 1123–1128.

[29] F. Chang, C.J. Chen, C.J. Lu, A linear-time component-labeling algorithm usingcontour tracing technique, Computer Vision and Image Understanding 93 (2)(2004) 206–220.

[30] C.-L. Liu, M. Koga, H. Fujisawa, Lexicon-driven segmentation and recognition ofhandwritten character strings for Japanese address reading, IEEE Transactionson Pattern Analysis and Machine Intelligence 24 (11) (2002) 1425–1437.

[31] A.V. Aho, J.E. Hopcroft, J.D. Ullman, Data Structures and Algorithms, Addison-Wesley, 1983.

[32] A. Schenker, M. Last, H. Bunke, A. Kandel, Fuzzy clustering with geneticallyadaptive scaling, International Journal of Image and Graphics 2 (4) (2002)557–572.

[33] C. Domeniconi, Locally adaptive techniques for pattern classification, Ph.D.Thesis, University of California, Riverside, 2002.

[34] E. Xing, A.Y. Ng, M. Jordan, S. Russell, Distance metric learning with applicationto clustering with side-information, in: Advances in Neural InformationProcessing Systems, vol. 15, 2003, pp. 505–512.

[35] L. Vandenberghe, S. Boyd, Semidefinite programming, SIAM Review 38 (1)(1996) 1–50.

[36] C.T. Zahn, Graph-theoretical methods for detecting and describing Gestaltclusters, IEEE Transactions on Computers 20 (1) (1971) 68–86.

[37] Y. He, L.H. Chen, A threshold criterion, auto-detection and its use in MST-basedclustering, Intelligence Data Analysis 9 (3) (2005) 253–271.

[38] L. Yujian, A clustering algorithm based on maximal �-distant subtrees, PatternRecognition 40 (5) (2007) 1425–1431.

[39] O. Grygorash, Y. Zhou, Z. Jorgensen, Minimum spanning tree based clusteringalgorithm, in: Proceeding of the 18th International Conference on Tools withArtificial Intelligence, 2006, pp. 73–81.

[40] I. Gath, A.B. Genv, Unsupervised optimal fuzzy clustering, IEEE Transactions onPattern Analysis and Machine Intelligence 11 (7) (1989) 773–781.

[41] T. Su, T. Zhang, D. Guan, Corpus-based HIT-MW database for offlinerecognition of general-purpose Chinese handwritten text, International Journalon Document Analysis and Recognition 10 (1) (2007) 27–38.

[42] F. Yin, C.-L. Liu, A tool for ground-truthing text lines and characters inhandwritten Chinese documents, submitted to ICDAR2009.

[43] C. Wolf, J.M. Jolion, Object count/area graphs for the evaluation of objectdetection and segmentation algorithms, International Journal on DocumentAnalysis and Recognition 8 (4) (2006) 280–296.

[44] I.T. Philips, A.K. Chhabra, Empirical performance evaluation of graphicsrecognition system, IEEE Transactions on Pattern Analysis and MachineIntelligence 21 (9) (1999) 849–870.

[45] S. Mao, T. Kanungo, Empirical performance evaluation methodology and itsapplication to page segmentation algorithm, IEEE Transactions on PatternAnalysis and Machine Intelligence 23 (3) (2001) 242–256.

[46] G. Liu, R.M. Haralick, Optimal matching problem in detection and recognitionperformance evaluation, Pattern Recognition 35 (10) (2002) 2125–2135.

[47] J. Nelder, R. Mead, A simplex method for function minimization, The ComputerJournal 7 (4) (1965) 308–313.

[48] S. Mao, T. Kanungo, Automatic training of page segmentation algorithms: anoptimization approach, in: Proceeding of the 15th International Conference onPattern Recognition, 2000, pp. 531–534.

About the Author—FEI YIN received the B.S. degree in Computer Science from Xidian University, Xi'an, China, the M.E. degree in Pattern Recognition and Intelligent Systemsfrom Huazhong University of Science and Technology, Wuhan, China, in 1999 and 2002, respectively. He is currently pursuing a Ph.D. degree in Pattern Recognition andIntelligent Systems at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interestsinclude document image analysis, handwritten character recognition and computer vision.

About the Author—CHENG-LIN LIU received the B.S. degree in Electronic Engineering from Wuhan University, Wuhan, China, the M.E. degree in Electronic Engineeringfrom Beijing Polytechnic University, Beijing, China, the Ph.D. degree in Pattern Recognition and Intelligent Systems from the Institute of Automation, Chinese Academy ofSciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a postdoctoral fellow at Korea Advanced Institute of Science and Technology (KAIST) and later atTokyo University of Agriculture and Technology from March 1996 to March 1999. From 1999 to 2004, he was a research staff member and later a senior researcher at theCentral Research Laboratory, Hitachi, Ltd., Tokyo, Japan. From 2005, he has been a Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation,Chinese Academy of Sciences, Beijing, China, and is now the Deputy Director of the laboratory. His research interests include pattern recognition, image processing, neuralnetworks, machine learning, and especially the applications to character recognition and document analysis. He has published over 80 technical papers at internationaljournals and conferences. He won the IAPR/ICDAR Young Investigator Award of 2005.

handwritten chinese text line segmentation by clustering with distance metric learning

Documents