drawing and recognizing chinese characters with recurrent ... · drawing and recognizing chinese...

14
1 Drawing and Recognizing Chinese Characters with Recurrent Neural Network Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, Yoshua Bengio Abstract—Recent deep learning based approaches have achieved great success on handwriting recognition. Chinese characters are among the most widely adopted writing systems in the world. Previous research has mainly focused on recognizing handwritten Chinese characters. However, recognition is only one aspect for understanding a language, another challenging and interesting task is to teach a machine to automatically write (pictographic) Chinese characters. In this paper, we propose a framework by using the recurrent neural network (RNN) as both a discriminative model for recognizing Chinese characters and a generative model for drawing (generating) Chinese characters. To recognize Chinese characters, previous methods usually adopt the convolutional neural network (CNN) models which require transforming the online handwriting trajectory into image-like representations. Instead, our RNN based approach is an end- to-end system which directly deals with the sequential structure and does not require any domain-specific knowledge. With the RNN system (combining an LSTM and GRU), state-of-the-art performance can be achieved on the ICDAR-2013 competition database. Furthermore, under the RNN framework, a conditional generative model with character embedding is proposed for automatically drawing recognizable Chinese characters. The generated characters (in vector format) are human-readable and also can be recognized by the discriminative RNN model with high accuracy. Experimental results verify the effectiveness of using RNNs as both generative and discriminative models for the tasks of drawing and recognizing Chinese characters. Index Terms—Recurrent neural network, LSTM, GRU, dis- criminative model, generative model, handwriting. I. I NTRODUCTION Reading and writing are among the most important and fundamental skills of human beings. Automatic recognition (or reading) of handwritten characters has been studied for a long time [1] and obtained great achievements during the past decades [2], [3]. However, the automatic drawing (or writing) of characters has not been studied as much, until the recent advances based on recurrent neural network for generating sequences [4]. In the development of human intelligence, the skills of reading and writing are mutual complementary. Therefore, for the purpose of machine intelligence, it would be interesting to handle them in unified framework. Chinese characters constitute the oldest continuously used system of writing in the world. Moreover, Chinese characters have been widely used (modified or extended) in many Asian countries such as China, Japan, Korea, and so on. There are more than tens of thousands of different Chinese characters. Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang and Cheng-Lin Liu are with the NLPR at Institute of Automation, Chinese Academy of Sciences, P.R. China. Email: {xyz, fyin, ymzhang, liucl}@nlpr.ia.ac.cn. Yoshua Bengio is with the MILA lab at University of Montreal, Canada. Email: [email protected]. Most of them can be well recognized by most people, however, nowadays, it is becoming more and more difficult for people to write them correctly, due to the overuse of keyboard or touch- screen based input methods. Compared with reading, writing of Chinese characters is gradually becoming a forgotten or missing skill. For the task of automatic recognition of handwritten Chi- nese characters, there are two main categories of approaches: online and offline methods. With the success of deep learn- ing [5], [6], the convolutional neural network (CNN) [7] has been widely applied for handwriting recognition. The strong priori knowledge of convolution makes the CNN a powerful tool for image classification. Since the offline characters are naturally represented as scanned images, it is natural and works well to apply CNNs to the task of offline recognition [8], [9], [10], [11]. However, in order to apply CNNs to online characters, the online handwriting trajectory should firstly be transformed to some image-like representations, such as the AMAP [12], the path signature maps [13] or the directional feature maps [14]. During the data acquisition of online handwriting, the pen- tip movements (xy-coordinates) and pen states (down or up) are automatically stored as (variable-length) sequential data. Instead of transforming them into image-like representations, we choose to deal with the raw sequential data in order to exploit the richer information it carries. In this paper, different from the traditional approaches based on CNNs, we propose to use recurrent neural networks (RNN) combined with bidirec- tional long short term memory (LSTM) [15], [16] and gated recurrent unit (GRU) [17] for online handwritten Chinese character recognition. RNN is shown to be very effective for English handwriting recognition [18]. For Chinese character recognition, compared with the CNN-based approaches, our method is fully end-to-end and does not require any domain- specific knowledge. State-of-the-art performance has been achieved by our method on the ICDAR-2013 competition database [19]. To the best of our knowledge, this is the first work on using RNNs for end-to-end online handwritten Chinese character recognition. Besides the recognition (reading) task, this paper also con- siders the automatic drawing of Chinese characters (writing task). Under the recurrent neural network framework, a con- ditional generative model is used to model the distribution of Chinese handwriting, allowing the model to generate new handwritten characters by sampling from the probability dis- tribution associated with the RNN. The study of generative models is an important and active research topic in the deep learning field [6]. Many useful generative models have been arXiv:1606.06539v1 [cs.CV] 21 Jun 2016

Upload: nguyenxuyen

Post on 17-May-2018

245 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Drawing and Recognizing Chinese Characters with Recurrent ... · Drawing and Recognizing Chinese Characters with Recurrent Neural ... generative model with character embedding is

1

Drawing and Recognizing Chinese Characterswith Recurrent Neural Network

Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, Yoshua Bengio

Abstract—Recent deep learning based approaches haveachieved great success on handwriting recognition. Chinesecharacters are among the most widely adopted writing systems inthe world. Previous research has mainly focused on recognizinghandwritten Chinese characters. However, recognition is only oneaspect for understanding a language, another challenging andinteresting task is to teach a machine to automatically write(pictographic) Chinese characters. In this paper, we propose aframework by using the recurrent neural network (RNN) as botha discriminative model for recognizing Chinese characters anda generative model for drawing (generating) Chinese characters.To recognize Chinese characters, previous methods usually adoptthe convolutional neural network (CNN) models which requiretransforming the online handwriting trajectory into image-likerepresentations. Instead, our RNN based approach is an end-to-end system which directly deals with the sequential structureand does not require any domain-specific knowledge. With theRNN system (combining an LSTM and GRU), state-of-the-artperformance can be achieved on the ICDAR-2013 competitiondatabase. Furthermore, under the RNN framework, a conditionalgenerative model with character embedding is proposed forautomatically drawing recognizable Chinese characters. Thegenerated characters (in vector format) are human-readable andalso can be recognized by the discriminative RNN model withhigh accuracy. Experimental results verify the effectiveness ofusing RNNs as both generative and discriminative models forthe tasks of drawing and recognizing Chinese characters.

Index Terms—Recurrent neural network, LSTM, GRU, dis-criminative model, generative model, handwriting.

I. INTRODUCTION

Reading and writing are among the most important andfundamental skills of human beings. Automatic recognition(or reading) of handwritten characters has been studied for along time [1] and obtained great achievements during the pastdecades [2], [3]. However, the automatic drawing (or writing)of characters has not been studied as much, until the recentadvances based on recurrent neural network for generatingsequences [4]. In the development of human intelligence,the skills of reading and writing are mutual complementary.Therefore, for the purpose of machine intelligence, it wouldbe interesting to handle them in unified framework.

Chinese characters constitute the oldest continuously usedsystem of writing in the world. Moreover, Chinese charactershave been widely used (modified or extended) in many Asiancountries such as China, Japan, Korea, and so on. There aremore than tens of thousands of different Chinese characters.

Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang and Cheng-Lin Liu are with theNLPR at Institute of Automation, Chinese Academy of Sciences, P.R. China.Email: xyz, fyin, ymzhang, [email protected].

Yoshua Bengio is with the MILA lab at University of Montreal, Canada.Email: [email protected].

Most of them can be well recognized by most people, however,nowadays, it is becoming more and more difficult for people towrite them correctly, due to the overuse of keyboard or touch-screen based input methods. Compared with reading, writingof Chinese characters is gradually becoming a forgotten ormissing skill.

For the task of automatic recognition of handwritten Chi-nese characters, there are two main categories of approaches:online and offline methods. With the success of deep learn-ing [5], [6], the convolutional neural network (CNN) [7] hasbeen widely applied for handwriting recognition. The strongpriori knowledge of convolution makes the CNN a powerfultool for image classification. Since the offline characters arenaturally represented as scanned images, it is natural andworks well to apply CNNs to the task of offline recognition [8],[9], [10], [11]. However, in order to apply CNNs to onlinecharacters, the online handwriting trajectory should firstly betransformed to some image-like representations, such as theAMAP [12], the path signature maps [13] or the directionalfeature maps [14].

During the data acquisition of online handwriting, the pen-tip movements (xy-coordinates) and pen states (down or up)are automatically stored as (variable-length) sequential data.Instead of transforming them into image-like representations,we choose to deal with the raw sequential data in order toexploit the richer information it carries. In this paper, differentfrom the traditional approaches based on CNNs, we propose touse recurrent neural networks (RNN) combined with bidirec-tional long short term memory (LSTM) [15], [16] and gatedrecurrent unit (GRU) [17] for online handwritten Chinesecharacter recognition. RNN is shown to be very effective forEnglish handwriting recognition [18]. For Chinese characterrecognition, compared with the CNN-based approaches, ourmethod is fully end-to-end and does not require any domain-specific knowledge. State-of-the-art performance has beenachieved by our method on the ICDAR-2013 competitiondatabase [19]. To the best of our knowledge, this is thefirst work on using RNNs for end-to-end online handwrittenChinese character recognition.

Besides the recognition (reading) task, this paper also con-siders the automatic drawing of Chinese characters (writingtask). Under the recurrent neural network framework, a con-ditional generative model is used to model the distributionof Chinese handwriting, allowing the model to generate newhandwritten characters by sampling from the probability dis-tribution associated with the RNN. The study of generativemodels is an important and active research topic in the deeplearning field [6]. Many useful generative models have been

arX

iv:1

606.

0653

9v1

[cs

.CV

] 2

1 Ju

n 20

16

Page 2: Drawing and Recognizing Chinese Characters with Recurrent ... · Drawing and Recognizing Chinese Characters with Recurrent Neural ... generative model with character embedding is

2

1 2

34

5

67

8 9 10

1112

13

14

15

161718

1920

1234

5

6

7

8

9 10

11 1213

1415 161718

1

2

3 4 5

67

8

9

10

111213

14

15

16

17

18

19 20

21 22

2324252627

Fig. 1. Illustration of three online handwritten Chinese characters. Each color represents a stroke and the numbers indicate the writing order. The purpose ofthis paper is to automatically recognize and draw (generate) real and cursive Chinese characters under a single framework based on recurrent neural networks.

proposed such as NADE [20], variational auto-encoder [21],DRAW [22], and so on. To better model the generatingprocess, the generative adversarial network (GAN) [23] si-multaneously train a generator to capture the data distributionand a discriminator to distinguish real and generated samplesin a min-max optimization framework. Under this framework,high-quality images can be generated with the LAPGAN [24]and DCGAN [25] models, which are extensions of the originalGAN.

Recently, it was shown by [26] that realistically-lookingChinese characters can be generated with DCGAN. However,the generated characters are offline images which ignorethe handwriting dynamics (temporal order and trajectory).To automatically generate the online (dynamic) handwritingtrajectory, the recurrent neural network (RNN) with LSTMwas shown to be very effective for English online handwritinggeneration [4]. The contribution of this paper is to studyhow to extend and adapt this technique for Chinese charactergeneration, considering the difference between English andChinese handwriting habits and the large number of categoriesfor Chinese characters. As shown by [27], fake and regular-written Chinese characters can be generated under the LSTM-RNN framework. However, a more interesting and challengingproblem is the generating of real (readable) and cursivehandwritten Chinese characters.

To reach this goal, we propose a conditional RNN-basedgenerative model (equipped with GRUs or LSTMs) to au-tomatically draw human-readable cursive Chinese characters.The character embedding is jointly trained with the generativemodel. Therefore, given a character class, different samples(belonging to the given class but with different writing styles)can be automatically generated by the RNN model conditionedon the embedding. In this paper, the tasks of automaticallydrawing and recognizing Chinese characters are completedboth with RNNs, seen as either generative or discriminativemodels. Therefore, to verify the quality of the generated char-acters, we can feed them into the pre-trained discriminativeRNN model to see whether they can be correctly classified ornot. It is found that most of the generated characters can beautomatically recognized with high accuracy. This verifies theeffectiveness of the proposed method in generating real andcursive Chinese characters.

The rest of this paper is organized as follows. Section IIintroduces the representation of online handwritten Chinesecharacters. Section III describes the discriminative RNN modelfor end-to-end recognition of handwritten Chinese characters.

Section IV reports the experimental results on the ICDAR-2013 competition database. Section V details the generativeRNN model for drawing recognizable Chinese characters.Section VI shows the examples and analyses of the generatedcharacters. At last, Section VII draws the concluding remarks.

II. REPRESENTATION FOR ONLINE HANDWRITTENCHINESE CHARACTER

Different from the static image based representation foroffline handwritten characters, rich dynamic (spatial and tem-poral) information can be collected in the writing process foronline handwritten characters, which can be represented as avariable length sequence:

[[x1, y1, s1], [x2, y2, s2], . . . , [xn, yn, sn]], (1)

where xi and yi are the xy-coordinates of the pen movementsand si indicates which stroke the point i belongs to. As shownin Fig. 1, Chinese characters usually contain multiple strokesand each stroke is produced by numerous points. Besides thecharacter shape information, the writing order is also preservedin the online sequential data, which is valuable and very hardto recover from the static image. Therefore, to capture thedynamic information for increasing recognition accuracy andalso to improve the naturalness of the generated characters,we directly make use of the raw sequential data rather thantransforming them into an image-like representation.

A. Removing Redundant Points

Different people may have different handwriting habits (e.g.,regular, fluent, cursive, and so on), resulting in significantlydifferent number of sampling points, even when they arewriting the same character. To remove the redundant points, wepropose a simple strategy to preprocess the sequence. Considera particular point (xi, yi, si). Let’s assume si = si−1 = si+1,otherwise, it will be the starting or ending point of a strokewhich will always be preserved. Whether to remove point ior not depends on two conditions. The first condition is basedon the distance of this point away from its former point:

√(xi − xi−1)2 + (yi − yi−1)2 < Tdist. (2)

As shown in Fig. 2(a), if point i is too close to point i− 1, itshould be removed. Moreover, point i should also be removedif it is on a straight line connecting points i−1 and i+ 1. Let

Page 3: Drawing and Recognizing Chinese Characters with Recurrent ... · Drawing and Recognizing Chinese Characters with Recurrent Neural ... generative model with character embedding is

3

3500 3520 3540 3560 3580 3600 3620 3640 3660 3680-7450

-7400

-7350

-7300

-7250

-7200pointNum=96 strokeNum=8

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5-3

-2

-1

0

1

2

3pointNum=44 strokeNum=8

(a) (b) (c) (d)

Fig. 2. (a) Removing of redundant points. (b) Coordinate normalization. (c) Character before preprocessing. (d) Character after preprocessing.

4xi = xi+1 − xi and 4yi = yi+1 − yi, the second conditionis based on the cosine similarity:

4xi−14xi +4yi−14yi(4x2

i−1 +4y2i−1)0.5(4x2

i +4y2i )0.5

> Tcos. (3)

If one of the conditions in Eq. (2) or Eq. (3) is satisfied, thepoint i should be removed. With this preprocessing, the shapeinformation of the character is still well preserved, but eachitem (point) in the new sequence becomes more informative,since the redundant points have been removed.

B. Coordinate Normalization

Another influence is coming from the variations in size orabsolute values of the coordinates for the characters capturedwith different devices or written by different people. There-fore, we must normalize the xy-coordinates into a standardinterval. Specifically, as shown in Fig. 2(b), consider a straightline L connecting two points (x1, y1) and (x2, y2), the pro-jections of this line onto x-axis and y-axis are

px(L) =

L

xdL =1

2len(L)(x1 + x2),

py(L) =

L

ydL =1

2len(L)(y1 + y2),

(4)

where len(L) =√

(x2 − x1)2 + (y2 − y1)2 denotes the lengthof L. With these information, we can estimate the mean valuesby projecting all lines onto x-axis and y-axis:

µx =

∑L∈Ω px(L)∑L∈Ω len(L)

, µy =

∑L∈Ω py(L)∑L∈Ω len(L)

, (5)

where Ω represents the set of all straight lines that connecttwo successive points within the same stroke. After this, weestimate the deviation (from mean) of the projections:

dx(L) =

L

(x− µx)2dL =1

3len(L)

[(x2 − µx)2+

(x1 − µx)2 + (x1 − µx)(x2 − µx)].

(6)

The standard deviation on x-axis can then be estimated as:

δx =

√∑L∈Ω dx(L)∑L∈Ω len(L)

. (7)

With all the information of µx, µy and δx estimated from onecharacter, we can now normalize the coordinates by:

xnew = (x− µx)/δx, ynew = (y − µy)/δx. (8)

This normalization is applied globally for all the points in thecharacter. Note that we do not estimate the standard deviationon the y-axis and the y-coordinate is also normalized by δx.The reason for doing so is to keep the original ratio of heightand width for the character, and also keep the writing directionfor each stroke. After coordinate normalization, each characteris placed in a standard xy-coordinate system, while the shapeof the character is kept unchanged.

C. IllustrationThe characters before and after preprocessing are illustrated

in Fig. 2 (c) and (d) respectively. It is shown that the charactershape is well preserved, and many redundant points have beenremoved. The original character contains 96 points while theprocessed character only has 44 points. This will make eachpoint more informative, which will benefit RNN modeling notjust because of speed but also because the issue of long-term dependencies [28] is thus reduced because sequencesare shorter. Moreover, as shown in Fig. 2(d), the coordinatesof the new character is normalized. In the new coordinatesystem, the position of (0, 0) is located in the central partof the character, and the deviations on the xy-axis are alsonormalized. Since in this paper we use a sequence-based rep-resentation, the preprocessing used here is different from thetraditional methods designed for image-based representation,such as the equidistance sampling [3] and character shapenormalization [29].

III. DISCRIMINATIVE MODEL: END-TO-ENDRECOGNITION WITH RECURRENT NEURAL NETWORK

The best established approaches for recognizing Chinesecharacters are based on transforming the sequence of Eq. (1)into some image-like representation [12], [13], [14] and thenapplying the convolutional neural network (CNN). For thepurpose of fully end-to-end recognition, we apply a recurrentneural network (RNN) directly on the raw sequential data.Due to better utilization of temporal and spatial information,our RNN approach can achieve higher accuracy than previousCNN and image-based models.

A. Representation for RecognitionFrom the sequence of Eq. (1) (after preprocessing), we

extract a six-dimensional representation for each straight lineLi connecting two points i and i+ 1:

Li = [xi, yi,4xi,4yi, I(si = si+1), I(si 6= si+1)], (9)

Page 4: Drawing and Recognizing Chinese Characters with Recurrent ... · Drawing and Recognizing Chinese Characters with Recurrent Neural ... generative model with character embedding is

4

where 4xi = xi+1 − xi, 4yi = yi+1 − yi and I(·) = 1when the condition is true and otherwise zero. In each Li,the first two terms are the start position of the line, andthe 3-4th terms are the direction of the pen movements,while the last two terms indicate the status of the pen, i.e.,[1, 0] means pen-down while [0, 1] means pen-up. With thisrepresentation, the character in Eq. (1) is transformed to anew sequence of [L1, L2, . . . , Ln−1]. To simplify the notationsused in following subsections, we will use [x1, x2, . . . , xk] todenote a general sequence, but note that each item xi here isactually the six-dimensional vector shown in Eq. (9).

B. Recurrent Neural Network (RNN)

The RNN is a natural generalization of the feedforwardneural networks to sequences [30]. Given a general inputsequence [x1, x2, . . . , xk] where xi ∈ Rd (different samplesmay have different sequence length k), at each time-step ofRNN modeling, a hidden state is produced, resulting in ahidden sequence of [h1, h2, . . . , hk]. The activation of thehidden state at time-step t is computed as a function f ofthe current input xt and previous hidden state ht−1 as:

ht = f(xt, ht−1). (10)

At each time-step, an optional output can be produced byyt = g(ht), resulting in an output sequence [y1, y2, . . . , yk],which can be used for sequence-to-sequence tasks, for ex-ample, based on the CTC framework [31]. In this section,the input sequence is encoded into a fixed-length vector forfinal classification, due to the recursively applied transitionfunction f . The RNN computes activations for each time-step which makes them extremely deep and can lead tovanishing or exploding gradients [28]. The choice of therecurrent computation f can have a big impact on the successof RNN because the spectrum of its Jacobian controls whethergradients tend to propagate well (or vanish or explode). In thispaper, we use both long short term memory (LSTM) [15] [16]and gated recurrent unit (GRU) [17] for RNN modeling.

C. Long Short Term Memory (LSTM)

LSTM [15] [16] is widely applied because it reduces thevanishing and exploding gradient problems and can learnlonger term dependencies. With LSTMs, for time-step t, thereis an input gate it, forget gate ft, and output gate ot:

it = sigm (Wixt + Uiht−1 + bi) , (11)ft = sigm (Wfxt + Ufht−1 + bf ) , (12)ot = sigm (Woxt + Uoht−1 + bo) , (13)ct = tanh (Wcxt + Ucht−1 + bc) , (14)ct = it ct + ft ct−1, (15)ht = ot tanh(ct), (16)

where W∗ is the input-to-hidden weight matrix, U∗ is the state-to-state recurrent weight matrix, and b∗ is the bias vector.The operation denotes the element-wise vector product.The hidden state of LSTM is the concatenation of (ct, ht).The long-term memory is saved in ct, and the forget gate and

LSTM/GRU

mean pooling and dropout

logistic regression

char 1 char 2 char 3755……

LSTM/GRU LSTM/GRU……

LSTM/GRU LSTM/GRU LSTM/GRU……

LSTM/GRU LSTM/GRU LSTM/GRU……

LSTM/GRU LSTM/GRU LSTM/GRU……

……

……

……

full layer and dropout

Fig. 3. The stacked bidirectional RNN for end-to-end recognition.

input gate are used to control the updating of ct as shown inEq. (15), while the output gate is used to control the updatingof ht as shown in Eq. (16).

D. Gated Recurrent Unit (GRU)

RNNs with gated recurrent units (GRU) [17] can be viewedas a light-weight version of LSTMs. Similar to the LSTMunit, the GRU also has gating units (reset gate rt and updategate zt) that modulate the flow of information inside the unit,however, without having a separate memory cell.

rt = sigm (Wrxt + Urht−1 + br) , (17)zt = sigm (Wzxt + Uzht−1 + bz) , (18)

ht = tanh (Wxt + U(rt ht−1) + b) , (19)

ht = zt ht−1 + (1− zt) ht. (20)

The activation of GRU ht is a linear interpolation betweenthe previous activation ht−1 and the candidate activation ht,controlled by the update gate zt. As shown in Eq. (19), whenreset gate rt is off (close to zero), the GRU acts like readingthe first symbol of an input sequence, allowing it to forget thepreviously computed state. It has been shown that GRUs andLSTMs have similar performance [32].

E. Stacked and Bidirectional RNN

In real applications, contexts from both past and future areuseful and complementary to each other [18]. Therefore, wecombine forward (left to right) and backward (right to left)recurrent layers to build a bidirectional RNN model [33].Moreover, the stacked recurrent layers are used to build a deepRNN system. As shown in Fig. 3, by passing [x1, x2, . . . , xk]through the forward recurrent layers, we can obtain a hiddenstate sequence of [h1, h2, . . . , hk]. Meanwhile, by passing thereversed sequence of [xk, xk−1, . . . , x1] through the backward

Page 5: Drawing and Recognizing Chinese Characters with Recurrent ... · Drawing and Recognizing Chinese Characters with Recurrent Neural ... generative model with character embedding is

5

Fig. 4. Illustration of data augmentation by sequential dropout on the inputsequence. The first column shows the original character, and the remainingcolumns are the characters after random dropout with probability 0.3.

recurrent layers, we can get another hidden state sequence of[h′1, h

′2, . . . , h

′k]. To make a final classification, all the hidden

states are combined to obtain a fixed-length representation forthe input sequence:

Fixed Length Feature =1

2k

k∑

i=1

(hi + h′i), (21)

which is then fed into a fully connected layer and a soft-max layer for final classification. The whole model can beefficiently and effectively trained by minimizing the multi-class negative log-likelihood loss with stochastic gradientdescent, using the back-propagation algorithm [34] to computegradients.

F. Regularization and Data Augmentation

Regularization is important for improving the generalizationperformance of deep neural network. As shown in Fig. 3,we apply the dropout [35] strategy on both the mean-poolinglayer and the fully connected layer. Another key to the successof deep neural network is the large number of training data.Recently, it is shown that the dropout can also be viewed assome kind of data augmentation [36]. For traditional image-based recognition system, random distortion is widely usedas the data augmentation strategy [13], [14]. In this paper,we use a simple strategy to regularize and augment data forsequence classification, which we call sequential dropout.As shown in Fig. 4, given a sequence, many sub-sequencescan be generated by randomly removing some items in theoriginal sequence with a given probability. This of coursecould make more sense for some distributions and workedwell for our data. For this to work, the preserved sub-sequencemust still contain enough information for categorization, asshown in Fig. 4. This strategy is similar to the previousproposed dropStroke [37] and dropSegment [38] methods inhandwriting analysis. However, our approach is much simplerand general, not requiring any domain-specific knowledge (e.g.stroke/segment detection) in order to identify pieces to bedropped out.

With sequential dropout on the input, we can build a largeenough (or infinite) training set, where each training sequence

is only shown once. In the testing process, two strategiescan be used. First, we can directly feed the full-sequenceinto the RNN for classification. Second, we can also applysequential dropout to obtain multiple sub-sequences, and thenmake an ensemble-based decision by fusing the classificationresults from these sub-sequences. The comparison of these twoapproaches will be discussed in the experimental section.

G. Initialization and Optimization

Initialization is very important for deep neural networks.We initialize all the weight matrices in LSTM/GRU (W∗ andU∗), full layer, and logistic regression layer by random valuesdrawn from the zero-mean Gaussian distribution with standarddeviation 0.01. All bias terms are initialized as zeros, exceptthe forget gate in LSTM. As suggested by [39], we initializethe forget gate bias bf to be a large value of 5. The purposeof doing so is to make sure that the forget gate in Eq. (12) isinitialized close to one (which means no forgetting), and thenlong-range dependencies can be better learned in the beginningof training. The cell and hidden states of LSTMs and GRUs areinitialized at zero. Optimization is another important issue fordeep learning. In this paper, we use a recently proposed first-order gradient method called Adam [40] which is based onadaptive estimation of lower-order moments. These strategieshelped to make the training of RNNs to be both efficient andeffective.

IV. EXPERIMENTS ON RECOGNIZING CHINESECHARACTERS

In this section, we present experiments on recognizingcursive online handwritten Chinese characters, for the purposeof evaluating and comparing the proposed discriminative RNNmodel with other state-of-the-art approaches.

A. Database

The database used for evaluation is from the ICDAR-2013competition [19] of online Chinese handwriting recognition,which is a third version of the previous competitions held onCCPR-2010 [41] and ICDAR-2011 [42]. The database used fortraining is the CASIA database [43] including OLHWDB1.0and OLHWDB1.1. There are totally 2,693,183 samples fortraining and 224,590 samples for testing. The training andtest data were produced by different writers. The number ofcharacter class is 3,755 (level-1 set of GB2312-80). Onlinehandwritten Chinese character recognition is a challengingproblem [3] due to the large number of character class-es, confusion between many similar characters, and distincthandwriting styles across individuals. Many teams from bothacademia and industry were involved in the three competitions,and the recognition accuracy had been promoted gradually andsignificantly through the competitions [41], [42], [19].

B. Implementation Details

In this paper, each character is represented by a sequenceas shown in Eq. (1). The two hyper-parameters used inSection II for preprocessing are Tdist = 0.01 × maxH,W

Page 6: Drawing and Recognizing Chinese Characters with Recurrent ... · Drawing and Recognizing Chinese Characters with Recurrent Neural ... generative model with character embedding is

6

TABLE ICOMPARISON OF DIFFERENT NETWORK ARCHITECTURES FOR ONLINE HANDWRITTEN CHINESE CHARACTER RECOGNITION.

Name Architecture Recurrent Type Memory Train Time Test Speed Train Acc. Test Acc.

NET1 6 → [500] → 200 → 3755 LSTM 11.00MB 95.16h 0.3792ms 98.07% 97.67%

NET2 6 → [500] → 200 → 3755 GRU 9.06MB 75.92h 0.2949ms 97.81% 97.71%

NET3 6 → [100, 500] → 200 → 3755 LSTM 12.76MB 125.43h 0.5063ms 98.21% 97.70%

NET4 6 → [100, 500] → 200 → 3755 GRU 10.38MB 99.77h 0.3774ms 97.87% 97.76%

NET5 6 → [100, 300, 500] → 200 → 3755 LSTM 19.48MB 216.75h 0.7974ms 98.67% 97.80%

NET6 6 → [100, 300, 500] → 200 → 3755 GRU 15.43MB 168.80h 0.6137ms 97.97% 97.77%

and Tcos = 0.99, where H is the height and W is the width ofthe character. After preprocessing, the average length of thesequences for each character is about 50. As shown in Fig. 3,to increase the generalization performance, the dropout is usedfor the mean pooling layer and full layer with probability 0.1.Moreover, dropout (with probability 0.3) is also used on inputsequence for data augmentation as described in Section III-F.The initialization of the network is described in Section III-G.The optimization algorithm is the Adam [40] with mini-batchsize 1000. The initial learning rate is set to be 0.001 andthen decreased by ×0.3 when the cost or accuracy on thetraining data stop improving. After each epoch, we shuffle thetraining data to make different mini-batches. All the modelswere implemented under the Theano [44], [45] platform usingthe NVIDIA Titan-X 12G GPU.

C. Experimental Results

Table I shows the comparison of different network architec-tures which can be represented by a general form as:

A→ [B1, . . . , Bn]→ C → D. (22)

The symbol A is the dimension for each element in the inputsequence. Moreover, the [B1, . . . , Bn] represents n stackedbidirectional recurrent layers as shown in Fig. 3, and Bi isthe dimension for the hidden states of LSTM or GRU at thei-th recurrent layer. The symbol C is the number of hiddenunits for the full layer, and D is the number of units forthe logistic regression layer (also the number of characterclasses). The 3th column in Table I shows the recurrent unittype (LSTM or GRU) for each model. Different networks arecompared from five aspects including: memory consumptionin 4th column, total training time (in hours) in 5th column,evaluation/testing speed (in millisecond) for one character in6th column, training accuracy in 7th column, and test accuracyin the last column. Other configurations are totally the sameas described in Section IV-B to give a fair comparison ofdifferent architectures. It is shown that the best performance(test accuracy) is achieved by NET5, while NET4 and NET6are very competitive with NET5.

D. Comparison of LSTM and GRU

As shown in Sections III-C and III-D, both LSTM and GRUadopt the gating strategy to control the information flow, and

TABLE IITEST ACCURACIES (%) OF ENSEMBLE-BASED DECISIONS FROM

SUB-SEQUENCES GENERATED BY RANDOM DROPOUT.

Ensemble of Sub-Sequences

Name Full 1 5 10 15 20 30

NET1 97.67 96.53 97.74 97.82 97.84 97.84 97.86

NET2 97.71 96.56 97.77 97.84 97.86 97.85 97.89

NET3 97.70 96.56 97.71 97.82 97.84 97.85 97.86

NET4 97.76 96.54 97.78 97.86 97.87 97.88 97.89

NET5 97.80 96.79 97.82 97.91 97.93 97.94 97.96

NET6 97.77 96.64 97.79 97.87 97.88 97.89 97.91

therefore allow the model to capture long-term dependenceembedded in the sequential data. In this paper, multiple RNNmodels with either LSTM or GRU recurrent units were trainedand compared under the same configurations. As shown inTable I, from the perspective of test accuracy, NET2 outper-forms NET1, NET4 beats NET3, and NET5 is better thanNET6. However, the differences are not significant. Therefore,the only conclusion we can drawn is that LSTM and GRUhave comparable prediction accuracies for our classificationtask. Another finding is that LSTM usually leads to highertraining accuracy but not necessarily higher test accuracy. Thismay suggest that GRU has some ability to avoid over-fitting.Furthermore, as revealed in Table I, from the perspectivesof memory consumption, training time, and especially testingspeed, we can conclude that GRU is much better than LSTM.The GRU can be viewed as a light-weight version of LSTM,and still shares similar functionalities with LSTM, whichmakes GRU favoured by practical applications with particularrequirements on memory or speed.

E. Comparison of Different Depths

As described in Section III-E, the stacked bidirectionalrecurrent layers are used to build the deep RNN systems.Different depths for the networks are also compared in Ta-ble I. Compared with only one bidirectional recurrent layer(NET1 and NET2), stacking two layers (NET3 and NET4)and three layers (NET5 and NET6) can indeed improve boththe training and test accuracies. However, the improvements

Page 7: Drawing and Recognizing Chinese Characters with Recurrent ... · Drawing and Recognizing Chinese Characters with Recurrent Neural ... generative model with character embedding is

7

TABLE IIIRESULTS ON ICDAR-2013 COMPETITION DATABASE OF ONLINE

HANDWRITTEN CHINESE CHARACTER RECOGNITION.

Methods: Ref. Memory Accuracy

Human Performance [19] n/a 95.19%

Traditional Benchmark [46] 120.0MB 95.31%

ICDAR-2011 Winner: VO-3 [42] 41.62MB 95.77%

ICDAR-2013 Winner: UWarwick [19] 37.80MB 97.39%

ICDAR-2013 Runner-up: VO-3 [19] 87.60MB 96.87%

DropSample-DCNN [14] 15.00MB 97.23%

DropSample-DCNN-Ensemble-9 [14] 135.0MB 97.51%

RNN: NET4 ours 10.38MB 97.76%

RNN: NET4-subseq30 ours 10.38MB 97.89%

RNN: Ensemble-NET123456 ours 78.11MB 98.15%

are not significant and also vanishing when more layersbeing stacked. This is because the recurrent units maintainactivations for each time-step which already make the RNNmodel to be extremely deep, and therefore, stacking morelayers will not bring too much additional discriminative abilityto the model. Moreover, as shown in Table I, with morestacked recurrent layers, both the training and testing time areincreased dramatically. Therefore, we did not consider morethan three stacked recurrent layers. From the perspectives ofboth accuracy and efficiency, in real applications, NET4 ispreferred among the six different network architectures.

F. Ensemble-based Decision from Sub-sequences

In our experiments, the dropout [35] strategy is applied notonly as a regularization (Fig. 3) but also as a data augmentationmethod (Section III-F). As shown in Fig. 4, in the testingprocess, we can still apply dropout on the input sequenceto generate multiple sub-sequences, and then make ensemble-based decisions to further improve the accuracy. Specifically,the probabilities (outputs from one network) of each sub-sequence are averaged to make the final prediction.

Tabel II reports the results for ensemble-based decisionsfrom sub-sequences. It is shown that with only one sub-sequence, the accuracy is inferior compared with the full-sequence. This is easy to understand, since there exists in-formation loss in each sub-sequence. However, with moreand more randomly sampled sub-sequences being involvedin the ensemble, the classification accuracies are graduallyimproved. Finally, with the ensemble of 30 sub-sequences,the accuracies for different networks become consistentlyhigher than the full-sequence based prediction. These resultsverified the effectiveness of using dropout for ensemble-basedsequence classification.

G. Comparison with Other State-of-the-art Approaches

To compare our method with other approaches, Table IIIlists the state-of-the-art performance achieved by previous

works on the ICDAR-2013 competition database [19]. It isshown that the deep learning based approaches outperform thetraditional methods with large margins. In the ICDAR-2011competition, the winner is the Vision Objects Ltd. from Franceusing a multilayer perceptron (MLP) classifier. Moreover, inICDAR-2013, the winner is from University of Warwick, UK,using the path signature feature map and a deep convolutionalneural network [13]. Recently, the state-of-the-art performancehas been achieved by [14] with domain-specific knowledgeand the ensemble of nine convolutional neural networks.

As revealed in Table III and Table I, all of our models (fromNET1 to NET6) can easily outperform previous benchmarks.Taking NET4 as an example, compared with other approaches,it is better from the aspects of both memory consumptionand classification accuracy. The previous best performancehas usually been achieved with convolutional neural networks(CNN), which has a particular requirement of transforming theonline sequential handwriting data into some image-like repre-sentations [12], [13], [14]. On the contrary, our discriminativeRNN model directly deals with the raw sequential data, andtherefore has the potential to exploit additional informationwhich is discarded in the spatial representations. Moreover,our method is also fully end-to-end, depending only on genericpriors about sequential data processing, and not requiring anyother domain-specific knowledge. These results suggest that:compared with CNNs, RNNs should be the first choice foronline handwriting recognition, due to their powerful abilityin sequence processing and the natural sequential property ofonline handwriting.

As shown in Table III, the ensemble of 30 sub-sequenceswith NET4 (NET4-subseq30) running on different randomdraws of sequential dropout (as discussed in Section III-F)can further improve the performance of NET4. An advan-tage for this ensemble is that only one trained model isrequired which will save the time and memory resources,in comparison with usual ensembles where multiple modelsshould be trained. However, a drawback is that the numberof randomly sampled sub-sequences should be large enoughto guarantee the ensemble performance, which will be time-consuming for evaluation. Another commonly used type ofensemble is obtained by model averaging from separatelytrained models. The classification performance by combiningthe six pre-trained networks (NET1 to NET6) is shown in thelast row of Table III. Due to the differences in network depthsand recurrent types, the six networks are complementary toeach other. Finally, with this kind of ensemble, this paperreached the accuracy of 98.15%, which is a new state-of-the-art and significantly outperforms all previously reported resultsfor online handwritten Chinese character recognition.

V. GENERATIVE MODEL: AUTOMATIC DRAWINGRECOGNIZABLE CHINESE CHARACTERS

Given an input sequence x and the corresponding characterclass y, the purpose of a discriminative model (as described inSection III) is to learn p(y|x). On the other hand, the purposeof a generative model is to learn p(x) or p(x|y) (conditionalgenerative model). In other words, by modeling the distribu-

Page 8: Drawing and Recognizing Chinese Characters with Recurrent ... · Drawing and Recognizing Chinese Characters with Recurrent Neural ... generative model with character embedding is

8

GRU……

GMM softmax

loss

……GRU

GMM softmax

loss

char

GRU……

GMM softmax

……GRU

GMM softmax

char

random sampling max

[0,0,1]?End

max

(a) (b)

Fig. 5. For the time-step t in the generative RNN model: (a) illustration of the training process, and (b) illustration of the drawing/generating process.

tion of the sequences, the generative model can be used todraw (generate) new handwritten characters automatically.

A. Representation for Generation

Compared with the representation used in Section III-A, therepresentation for generating characters is slightly different.Motivated by [4], each character can be represented as:

[[d1, s1], [d2, s2], . . . , [dk, sk]], (23)

where di = [4xi,4yi] ∈ R2 is the pen moving directionwhich can be viewed as a straight line. We can draw thischaracter with multiple lines by concatenating [d1, d2, . . . , dk],i.e., the ending position of previous line is the starting positionof current line. Since one character usually contains multiplestrokes, each line di may be either pen-down (should bedrawn on the paper) or pen-up (should be ignored). Therefore,another variable si ∈ R3 is used to represent the status of pen.As suggested by [27], three states should be considered:1

si =

[1, 0, 0], pen-down,[0, 1, 0], pen-up,[0, 0, 1], end-of-char.

(24)

With the end-of-char value of si, the RNN can automaticallydecide when to finish the generating process. Using the rep-resentation in Eq. (23), the character can be drawn in vectorformat, which is more plausible and natural than a bitmapimage.

B. Conditional Generative RNN Model

To model the distribution of the handwriting sequence, agenerative RNN model is utilized. Considering that there area large number of different Chinese characters, and in orderto generate real and readable characters, the character embed-ding is trained jointly with the RNN model. The characterembedding is a matrix E ∈ Rd×N where N is the number ofcharacter classes and d is the embedded dimensionality. Eachcolumn in E is the embedded vector for a particular class.

1Note that the symbol si used here has a different meaning with the siused in Eq. (1), and the si used here can be easily deduced from Eq. (1).

In the following descriptions, we use c ∈ Rd to represent theembedding vector for a general character class.

Our previous experiments show that, GRUs and LSTMshave comparable performance, but the computation of GRU ismore efficient. Therefore, we build our generative RNN modelbased on GRUs [17] rather than LSTMs. As shown in Fig. 5,at time-step t, the inputs for a GRU include:• previous hidden state ht−1 ∈ RD,• current pen-direction dt ∈ R2,• current pen-state st ∈ R3,• character embedding c ∈ Rd.

Following the GRU gating strategy, the updating of the hiddenstate and the computation of output for time-step t are:

d′t = tanh (Wddt + bd) , (25)s′t = tanh (Wsst + bs) , (26)rt = sigm (Wrht−1 + Urd

′t + Vrs

′t +Mrc+ br) , (27)

zt = sigm (Wzht−1 + Uzd′t + Vzs

′t +Mzc+ bz) , (28)

ht = tanh (W (rt ht−1) + Ud′t + V s′t +Mc+ b) , (29)

ht = zt ht−1 + (1− zt) ht, (30)ot = tanh (Woht + Uod

′t + Vos

′t +Moc+ bo) , (31)

where W∗, U∗, V∗,M∗ are weight matrices and b∗ is the weightvector for GRU. Since both the pen-direction dt and pen-statest are low-dimensional, we first transform them into higher-dimensional spaces by Eqs. (25) and (26). After that, the resetgate rt is computed in Eq. (27) and update gate zt is computedin Eq. (28). The candidate hidden state in Eq. (29) is controlledby the reset gate rt which can automatically decide whetherto forget previous state or not. The new hidden state ht inEq. (30) is then updated as a combination of previous andcandidate hidden states controlled by update gate zt. At last,an output vector ot is calculated in Eq. (31). To improve thegeneralization performance, the dropout strategy [35] is alsoapplied on ot.

In all these computations, the character embedding c isprovided to remind RNN that this is the drawing of a particularcharacter other than random scrawling. The dynamic writinginformation for this character is encoded with the hidden state

Page 9: Drawing and Recognizing Chinese Characters with Recurrent ... · Drawing and Recognizing Chinese Characters with Recurrent Neural ... generative model with character embedding is

9

of the RNN, which is automatically updated and controlled bythe GRU (remember or forget) due to the gating strategies. Ateach time-step, an output ot is produced based on new hiddenstate as shown in Eq. (31). From this output, the next pen-direction and pen-state should be inferred to continue the taskof automatic drawing.

C. GMM Modeling of Pen-Direction: From ot To dt+1

As suggested by [4], the Gaussian mixture model (GMM) isused for the pen-direction. Suppose there are M componentsin the GMM, a 5 ×M -dimensional vector is then calculatedbased on the output vector ot as:(πj , µj

x, µjy, δ

jx, δ

jyMj=1

)∈ R5M = Wgmm×ot+bgmm, (32)

πj =exp(πj)∑j′ exp(πj′)

⇒ πj ∈ (0, 1),∑

j

πj = 1, (33)

µjx = µj

x ⇒ µjx ∈ R, (34)

µjy = µj

y ⇒ µjy ∈ R, (35)

δjx = exp(δjx

)⇒ δjx > 0, (36)

δjy = exp(δjy

)⇒ δjy > 0. (37)

Note the above five variables are for the time-step t+ 1, andhere we omit the subscript of t+ 1 for simplification. For thej-th component in GMM, πj denotes the component weight,µjx and µj

y denotes the means, while δjx and δjy are the standarddeviations. The probability density Pd(dt+1) for the next pen-direction dt+1 = [4xt+1,4yt+1] is defined as:

Pd(dt+1) =M∑

j=1

πjN(dt+1|µj

x, µjy, δ

jx, δ

jy

)

=

M∑

j=1

πjN(4xt+1|µj

x, δjx

)N(4yt+1|µj

y, δjy

),

(38)

where

N (x|µ, δ) =1

δ√

2πexp

(− (x− µ)2

2δ2

). (39)

Differently from [4], here for each mixture component, thex-axis and y-axis are assumed to be independent, whichwill simplify the model and still gives similar performancecompared with the full bivariate Gaussian model. Using aGMM for modeling the pen-direction can capture the dynamicinformation of different handwriting styles, and hence allowRNNs to generate diverse handwritten characters.

D. SoftMax Modeling of Pen-State: From ot To st+1

To model the discrete pen-states (pen-down, pen-up, or end-of-char), the softmax activation is applied on the transforma-tion of ot to give a probability for each state:(p1t+1, p

2t+1, p

3t+1

)∈ R3 = Wsoftmax × ot + bsoftmax, (40)

pit+1 =exp(pit+1)

∑3j=1 exp(pjt+1)

∈ (0, 1)⇒3∑

i=1

pit+1 = 1. (41)

1

2

3

5

10

20

30

40

50

Epoch:Drawing of Character:

Fig. 6. Illustration of the generating/drawing for one particular character indifferent epochs of the training process.

The probability density Ps(st+1) for the next pen-state st+1 =[s1

t+1, s2t+1, s

3t+1] is then defined as:

Ps(st+1) =3∑

i=1

sit+1pit+1. (42)

With this softmax modeling, RNN can automatically decidethe status of pen and also the ending time of the generatingprocess, according to the dynamic changes in the hidden stateof GRU during drawing/writing process.

E. Training of the Generative RNN ModelTo train the generative RNN model, a loss function should

be defined. Given a character represented by a sequence inEq. (23) and its corresponding character embedding c ∈ Rd, bypassing them through the RNN model, as shown in Fig. 5(a),the final loss can be defined as the summation of the losses ateach time-step:

loss = −∑

t

log (Pd(dt+1)) + log (Ps(st+1)) . (43)

However, as shown by [27], directly minimizing this lossfunction will lead to poor performance, because the three pen-states in Eq. (24) are not equally happened in the trainingprocess. The occurrence of pen-down is too frequent whichalways dominate the loss, especially compared with “end-of-char” state which only occur once for each character. To reducethe influence from this unbalanced problem, a cost-sensitiveapproach [27] should be used to define a new loss:

loss = −∑

t

log (Pd(dt+1)) +

3∑

i=1

wisit+1 log(pit+1)

,

(44)where [w1, w2, w3] = [1, 5, 100] are the weights for the lossesof pen-down, pen-up, and end-of-char respectively. In this way,the RNN can be trained effectively to produce real characters.Other strategies such as initialization and optimization are thesame as Section III-G.

Page 10: Drawing and Recognizing Chinese Characters with Recurrent ... · Drawing and Recognizing Chinese Characters with Recurrent Neural ... generative model with character embedding is

10

F. Automatic Drawing of Recognizable Characters

After training, the model can be used for automatic drawingof handwritten characters. Since this is a conditional generativemodel, we can first select which character to draw by choosinga column (denoted as c ∈ Rd) from the character embeddingmatrix E ∈ Rd×N , and this vector will be used at each time-step of generating (see Fig. 5(b)). The initial hidden state,pen-direction, pen-state are all set as zeros. After that, asshown in Fig. 5(b), at time-step t, a pen-direction dt+1 israndomly sampled from Pd in Eq. (38). Since this is a GMMmodel, the sampling can be efficiently implemented by firstrandomly choosing a component and then sampling from thecorresponding Gaussian distribution. The pen-state st+1 isthen inferred from Eq. (41) with hard-max, i.e., setting thelargest element to one and remaining elements to zero.

As shown in Fig. 5(b), if the pen-state is changed to [0, 0, 1](end-of-char), the generating process should be finished. Oth-erwise, we should continue the drawing process by feeding[ht, dt+1, st+1, c] into the GRU to generate dt+2 and st+2. Byrepeating this process and drawing the generated lines on thepaper according to the pen-states (down or up), we can obtainthe automatically generated character, which should be cursiveand human-readable.

VI. EXPERIMENTS ON DRAWING CHINESE CHARACTERS

In this section, we will show the generated charactersvisually, and analyze the quality of the generated charactersby feeding them into the discriminative RNN model to checkwhether they are recognizable or not. Moreover, we will alsodiscuss properties of the character embedding matrix.

A. Database

To train the generative RNN model, we still use the databaseof CASIA [43] including OLHWDB1.0 and OLHWDB1.1.There are more than two million training samples and allthe characters are written cursively with frequently-used hand-writing habits from different individuals. This is significantlydifferent from the experiment in [27], where only 11,000regular-written samples are used for training. Each characteris now represented by multiple lines as shown in Eq. (23).The two hyper-parameters used in Section II for removingredundant information are Tdist = 0.05 × maxH,W andTcos = 0.9, where H is the height and W is the width ofthe character. After preprocessing, the average length of eachsequence (character) is about 27. Note that the sequence usedhere is shorter than the sequence used for classification inSection IV. The reason for doing so is to make each line inEq. (23) more informative and then alleviate the influence fromnoise strokes in the generating process.

B. Implementation Details

Our generative RNN model is capable of drawing 3,755different characters. The dimension for the character em-bedding (as shown in Section V-B) is 500. In Eqs. (25)and (26), both the low-dimensional pen-direction and pen-stateare transformed to a 300-dimensional space. The dimension for

character embedding matrix: 500 3755

10-nearest neighbors

Fig. 7. The character embedding matrix and the nearest neighbors (of somerepresentative characters) calculated from the embedding matrix.

the hidden state of GRU is 1000, therefore, the dimensionsof the vectors in Eqs. (27)(28)(29)(30) are all 1000. Thedimension for the output vector in Eq. (31) is 300, and thedropout probability applied on this output vector is 0.3. Thenumber of mixture components in GMM of Section V-C is30. With these configurations, the generative RNN model istrained using Adam [40] with mini-batch size 500 and initiallearning rate 0.001. With Theano [44], [45] and an NVIDIATitan-X 12G GPU, the training of our generative RNN modeltook about 50 hours to converge.

C. Illustration of the Training Process

To monitor the training process, Fig. 6 shows the generatedcharacters (for the first character among the 3,755 classes)in each epoch. It is shown that in the very beginning, themodel seems to be confused by so many character classes.In the first three epochs, the generated characters look likesome random mixtures (combinations) of different characters,which are impossible to read. Until the 10th epoch, someinitial structures can be found for this particular character.After that, with the training process continued, the generatedcharacters become more and more clear. In the 50th epoch, allthe generated characters can be easily recognized by a humanwith high confidence. Moreover, all the generated charactersare cursive, and different handwriting styles can be foundamong them. This verifies the effectiveness of the trainingprocess for the generative RNN model. Another finding inthe experiments is that the Adam [40] optimization algorithmworks much better for our generative RNN model, comparedwith the traditional stochastic gradient descent (SGD) withmomentum. With the Adam algorithm, our model convergedwithin about 60 epochs.

Page 11: Drawing and Recognizing Chinese Characters with Recurrent ... · Drawing and Recognizing Chinese Characters with Recurrent Neural ... generative model with character embedding is

11

Fig. 8. Illustration of the automatically generated characters for different classes. Each row represents a particular character class. To give a better illustration,each color (randomly selected) denotes one straight line (pen-down) as shown in Eq. (23).

D. Property of Character Embedding Matrix

As shown in Section V-B, the generative RNN model isjointly trained with the character embedding matrix E ∈Rd×N , which allows the model to generate characters accord-ing to the class indexes. We show the character embeddingmatrix (500 × 3755) in Fig. 7. Each column in this matrixis the embedded vector for a particular class. The goal of theembedding is to indicate the RNN the identity of the characterto be generated. Therefore, the characters with similar writingtrajectory (or similar shape) are supposed to be close to eachother in the embedded space. To verify this, we calculatethe nearest neighbors of a character category according tothe Euclidean distance in the embedded space. As shownin Fig. 7, the nearest neighbors of one character usuallyhave similar shape or share similar sub-structures with thischaracter. Note that in the training process, we did not utilizeany between-class information, the objective of the modelis just to maximize the generating probability conditionedon the embedding. The character-relationship is automaticallylearned from the handwriting similarity of each character.These results verify the effectiveness of the joint training ofcharacter embedding and the generative RNN model, whichtogether form a model of the conditional distribution p(x|y),where x is the handwritten trajectory and y is the charactercategory.

E. Illustration of Automatically Generated Characters

With the character embedding, our RNN model can draw3,755 different characters, by first choosing a column from theembedding matrix and then feeding it into every step of thegenerating process as shown in Fig. 5. To verify the abilityof drawing different characters, Fig. 8 shows the automati-cally generated characters for nine different classes. All thegenerated characters are new and different from the trainingdata. The generating/drawing is implemented randomly step-by-step, i.e., by randomly sampling the pen-direction fromthe GMM model as described in Section V-C, and updatingthe hidden states of GRU according to previous sampledhandwriting trajectory as shown in Section V-B. Moreover,all the characters are automatically ended with the “end-of-char” pen-state as discussed in Section V-D, which means theRNN can automatically decide when and how to finish thewriting/drawing process.

All the automatically generated characters are human-readable, and we can not even distinguish them from the realhandwritten characters produced by human beings. The mem-ory size of our generative RNN model is only 33.79MB, but itcan draw as more as 3,755 different characters. This means wesuccessfully transformed a large handwriting database (withmore than two million samples) into a small RNN generator,from which we can sample infinite different characters. With

Page 12: Drawing and Recognizing Chinese Characters with Recurrent ... · Drawing and Recognizing Chinese Characters with Recurrent Neural ... generative model with character embedding is

12

(a)

(b) (c)

500 1000 1500 2000 2500 3000 35000

0.2

0.4

0.6

0.8

1

Acc

urac

y

Character Class Index

Mean Accuracy: 0.9398

Fig. 9. (a): The classification accuracies of the automatically generated characters for the 3,755 classes. (b): The generated characters which have lowrecognition rates. (c): The generated characters which have perfect (100%) recognition rates.

the GMM modeling of pen-direction, different handwritinghabits can be covered in the writing process. As shown inFig. 8, in each row (character), there are multiple handwritingstyles, e.g., regular, fluent, and cursive. These results verify notonly the ability of the model in drawing recognizable Chinesecharacters but also the diversity of the generative model inhandling different handwriting styles.

Nevertheless, we still note that the generated characters arenot 100% perfect. As shown by the last few rows in Fig. 8,there are some missing strokes in the generated characterswhich make them hard to read. Therefore, we must find somemethods to estimate the quality of the generated characters ina quantitative manner.

F. Quality Analysis: Recognizable or Not?

To further analyze the quality of the generated characters,the discriminative RNN model in Section III is used to checkwhether the generated characters are recognizable or not. Thearchitecture of NET4 in Table I is utilized due to its goodperformance in the recognition task. Both the discriminativeand generative RNN models are trained with real handwrittencharacters from [43]. After that, for each of the 3,755-class,we randomly generate 100 characters with the generativeRNN model, resulting in 375,500 test samples, which arethen feed into the discriminative RNN model for evaluation.The classification accuracies of the generated characters withrespect to different classes are shown in Fig. 9(a).

It is shown that for most classes, the generated characterscan be automatically recognized with very high accuracy.This verifies the ability of our generative model in correctlywriting thousands of different Chinese characters. In previouswork of [27], the LSTM-RNN is used to generate fake andregular-written Chinese characters. Instead, in this paper, ourgenerative model is conditioned on the character embedding,and a large real handwriting database containing differenthandwriting styles is used for training. Therefore, the automat-

ically generated characters in this paper are not only cursivebut also readable by both human and machine.

The average classification accuracy for all the generatedcharacters is 93.98%, which means most of the charactersare correctly classified. However, compared with the classi-fication accuracies for real characters as shown in Table I,the recognition accuracy for the generated characters is stilllower. As shown in Fig. 9(a), there are some particular classesleading to significantly low accuracies, i.e., below 50%. Tocheck what is going on there, we show some generatedcharacters with low recognition rate in Fig. 9(b). It is shownthat the wrongly classified characters usually come from theconfusable classes, i.e., the character classes which only havesome subtle difference in shape with another character class.In such case, the generative RNN model was not capableof capturing these small but important details for accuratedrawing of the particular character.

On the contrary, as shown in Fig. 9(c), for the characterclasses which do not have any confusion with other classes,the generated characters can be easily classified with 100%accuracy. Therefore, to further improve the quality of thegenerated characters, we should pay more attention to thesimilar/confusing character classes. One solution is to modifythe loss function in order to emphasize the training on theconfusing class pairs as suggested by [47]. Another strategy isintegrating the attention mechanism [48], [49] and the memorymechanism [50], [51] with the generative RNN model to allowthe model to dynamically memorize and focus on the criticalregion of a particular character during the writing process.These are future directions for further improving the qualityof the generative RNN model.

VII. CONCLUSION AND FUTURE WORK

This paper investigates two closely-coupled tasks: automat-ically reading and writing. Specifically, the recurrent neuralnetwork (RNN) is used as both discriminative and genera-tive models for recognizing and drawing cursive handwritten

Page 13: Drawing and Recognizing Chinese Characters with Recurrent ... · Drawing and Recognizing Chinese Characters with Recurrent Neural ... generative model with character embedding is

13

Chinese characters. In the discriminative model, the deepstacked bidirectional RNN model is integrated with both theLSTM and the GRU for recognition. Compared with previousconvolutional neural network (CNN) based approaches whichrequire some image-like representations, our method is fullyend-to-end by directly dealing with the raw sequential data.Due to the straightforward utilization of the spatial and tempo-ral information, our discriminative RNN model achieved newstate-of-the-art performance on the ICDAR-2013 competitiondatabase. High character recognition accuracy is essential fortext recognition [52], [53], hence, the discriminative RNNmodel can be hopefully combined with the CTC [31], [18] forsegmentation-free handwritten Chinese text recognition [54].Moreover, another potential direction is combining the pow-erful image processing ability of CNNs and the sequenceprocessing ability of RNNs to look for further accuracyimprovement on character recognition.

Besides recognition, this paper also considers automaticdrawing of real and cursive Chinese characters. A conditionalgenerative RNN model is jointly trained with the characterembedding, which allow the model to correctly write morethan thousands of different characters. The Gaussian mixturemodel (GMM) is used for modeling the pen-direction whichguarantee the diversity of the model in generating differenthandwriting styles. The generative RNN model can automat-ically decide when and how to finish the drawing processwith the modeling of three discrete pen-states. It is shownthat the generated characters are not only human-readablebut also recognizable by the discriminative RNN model withhigh accuracies. Other than drawing characters, an interestingfuture direction is to utilize the proposed method as buildingblocks for the synthesis of cursive handwritten Chinese texts.Moreover, in this paper, the generative model is conditioned onthe character embedding. Another important future extensionis to condition the generative RNN model on a static image(combined with convolution) and then automatically recover(or generate) the dynamic handwriting trajectory (order) fromthe static image [55], [56], which is a hard problem and hasgreat values in practical applications.

The relationship between the discriminative and generativemodels is also an important future research topic. Since thegenerative model is capable of producing real handwrittencharacters (with labels), a straightforward attempt is to makeit as a data augmentation strategy for the supervised trainingof the discriminative model. On the opposite direction, thediscriminative model can be used as some regularization [57]to improve the quality of the generative model. Moreover, thegenerative model can also cooperate with the discriminativemodel in an adversarial manner [23], [58]. Taking all thesetogether, an attractive and important future work is the simul-taneously training of the discriminative and generative modelsin an unified multi-task framework.

ACKNOWLEDGMENTS

The authors thank the developers of Theano [44], [45] forproviding such a good deep learning platform.

REFERENCES

[1] C. Suen, M. Berthod, and S. Mori, “Automatic recognition of handprint-ed characters: the state of the art,” Proceedings of IEEE, vol. 68, no. 4,pp. 469–487, 1980.

[2] R. Plamondon and S. Srihari, “Online and offline handwriting recog-nition: a comprehensive survey,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 22, no. 1, pp. 63–84, 2000.

[3] C.-L. Liu, S. Jaeger, and M. Nakagawa, “Online recognition of Chinesecharacters: The state-of-the-art,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 26, no. 2, pp. 198–213, 2004.

[4] A. Graves, “Generating sequences with recurrent neural networks,”arXiv:1308.0850, 2013.

[5] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of datawith neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.

[6] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,no. 7553, pp. 436–444, 2015.

[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[8] D. Ciresan and J. Schmidhuber, “Multi-column deep neural networks foroffline handwritten Chinese character classification,” arXiv:1309.0261,2013.

[9] C. Wu, W. Fan, Y. He, J. Sun, and S. Naoi, “Handwritten characterrecognition by alternately trained relaxation convolutional neural net-work,” Proc. Int’l Conf. Frontiers in Handwriting Recognition (ICFHR),pp. 291–296, 2014.

[10] Z. Zhong, L. Jin, and Z. Xie, “High performance offline handwrittenChinese character recognition using GoogLeNet and directional featuremaps,” Proc. Int’l Conf. Document Analysis and Recognition (ICDAR),2015.

[11] L. Chen, S. Wang, W. Fan, J. Sun, and S. Naoi, “Beyond human recog-nition: A CNN-based framework for handwritten character recognition,”Proc. Asian Conf. Pattern Recognition (ACPR), 2015.

[12] Y. Bengio, Y. LeCun, and D. Henderson, “Globally trained handwrittenword recognizer using spatial representation, space displacement neu-ral networks and hidden Markov models,” Proc. Advances in NeuralInformation Processing Systems (NIPS), pp. 937–944, 1994.

[13] B. Graham, “Sparse arrays of signatures for online character recogni-tion,” arXiv:1308.0371, 2013.

[14] W. Yang, L. Jin, D. Tao, Z. Xie, and Z. Feng, “DropSample: A newtraining method to enhance deep convolutional neural networks forlarge-scale unconstrained handwritten Chinese character recognition,”arXiv:1505.05354, 2015.

[15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” NeuralComputation, vol. 9, no. 8, pp. 1735–1780, 1997.

[16] F. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continualprediction with LSTM,” Neural Computation, vol. 12, no. 10, pp. 2451–2471, 2000.

[17] K. Cho, B. Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk,and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” Proc. Conf. Empirical Meth-ods in Natural Language Processing (EMNLP), 2014.

[18] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, andJ. Schmidhuber, “A novel connectionist system for unconstrained hand-writing recognition,” IEEE Trans. Pattern Analysis and Machine Intel-ligence, vol. 31, no. 5, pp. 855–868, 2009.

[19] F. Yin, Q.-F. Wang, X.-Y. Zhang, and C.-L. Liu, “ICDAR 2013 Chi-nese handwriting recognition competition,” Proc. Int’l Conf. DocumentAnalysis and Recognition (ICDAR), 2013.

[20] H. Larochelle and I. Murray, “The neural autoregressive distribution esti-mator,” Proc. Int’l Conf. Artificial Intelligence and Statistics (AISTATS),2011.

[21] D. Kingma and M. Welling, “Auto-encoding variational bayes,” arX-iv:1312.6114, 2013.

[22] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra,“DRAW: A recurrent neural network for image generation,” Proc. Int’lConf. Machine Learning, 2015.

[23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”Proc. Advances in Neural Information Processing Systems (NIPS), 2014.

[24] E. Denton, S. Chintala, A. Szlam, and R. Fergus, “Deep generativeimage models using a laplacian pyramid of adversarial networks,”arXiv:1506.05751, 2015.

[25] A. Radford, L. Metz, and S. Chintala, “Unsupervised representationlearning with deep convolutional generative adversarial networks,” arX-iv:1511.06434, 2015.

Page 14: Drawing and Recognizing Chinese Characters with Recurrent ... · Drawing and Recognizing Chinese Characters with Recurrent Neural ... generative model with character embedding is

14

[26] “Generating offline Chinese characters with DCGAN,” 2015. [Online].Available: http://www.genekogan.com/works/a-book-from-the-sky.html

[27] “Generating online fake Chinese characters with LSTM-RNN,”2015. [Online]. Available: http://blog.otoro.net/2015/12/28/recurrent-net-dreams-up-fake-chinese-characters-in-vector-format-with-tensorflow

[28] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencieswith gradient descent is difficult,” IEEE Trans. Neural Networks, vol. 5,no. 2, pp. 157–166, 1994.

[29] C.-L. Liu and K. Marukawa, “Pseudo two-dimensional shape normal-ization methods for handwritten Chinese character recognition,” PatternRecognition, vol. 38, no. 12, pp. 2242–2255, 2005.

[30] I. Goodfellow, A. Courville, and Y. Bengio, “Deep learning,” Book inpress, MIT Press, 2016.

[31] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionisttemporal classification: Labelling unsegmented sequence data with re-current neural networks,” Proc. Int’l Conf. Machine Learning (ICML),2006.

[32] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation ofgated recurrent neural networks on sequence modeling,” Proc. Advancesin Neural Information Processing Systems (NIPS), 2014.

[33] M. Schuster and K. Paliwal, “Bidirectional recurrent neural networks,”IEEE Trans. Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.

[34] D. Rumelhart, G. Hinton, and R. Williams, “Learning representationsby back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986.

[35] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: A simple way to prevent neural networks from overfit-ting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[36] X. Bouthillier, K. Konda, P. Vincent, and R. Memisevic, “Dropout asdata augmentation,” arXiv:1506.08700, 2015.

[37] W. Yang, L. Jin, and M. Liu, “Character-level Chinese writer identifi-cation using path signature feature, dropstroke and deep CNN,” Proc.Int’l Conf. Document Analysis and Recognition (ICDAR), 2015.

[38] ——, “DeepWriterID: An end-to-end online text-independent writeridentification system,” arXiv:1508.04945, 2015.

[39] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical explorationof recurrent network architectures,” Proc. Int’l Conf. Machine Learning(ICML), 2015.

[40] D. Kingma and J. Ba, “Adam: a method for stochastic optimization,”Proc. Int’l Conf. Learning Representations (ICLR), 2015.

[41] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “Chinese handwritingrecognition contest 2010,” Proc. Chinese Conf. Pattern Recognition(CCPR), 2010.

[42] C.-L. Liu, F. Yin, Q.-F. Wang, and D.-H. Wang, “ICDAR 2011 Chi-nese handwriting recognition competition,” Proc. Int’l Conf. DocumentAnalysis and Recognition (ICDAR), 2011.

[43] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “CASIA online andoffline Chinese handwriting databases,” Proc. Int’l Conf. DocumentAnalysis and Recognition (ICDAR), 2011.

[44] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Berg-eron, N. Bouchard, D. Warde-Farley, and Y. Bengio, “Theano: newfeatures and speed improvements,” NIPS Deep Learning Workshop,2012.

[45] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Des-jardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: A CPU andGPU math expression compiler,” Proc. Python for Scientific ComputingConf. (SciPy), 2010.

[46] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “Online and of-fline handwritten Chinese character recognition: Benchmarking on newdatabases,” Pattern Recognition, vol. 46, no. 1, pp. 155–162, 2013.

[47] I.-J. Kim, C. Choi, and S.-H. Lee, “Improving discrimination abilityof convolutional neural networks by hybrid learning,” Int’l Journal onDocument Analysis and Recognition, pp. 1–9, 2015.

[48] K. Cho, A. Courville, and Y. Bengio, “Describing multimedia contentusing attention-based encoder-decoder networks,” IEEE Trans. Multime-dia, vol. 17, no. 11, pp. 1875–1886, 2015.

[49] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov,R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image captiongeneration with visual attention,” Proc. Int’l Conf. Machine Learning(ICML), 2015.

[50] A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,”arXiv:1410.5401, 2014.

[51] J. Weston, S. Chopra, and A. Bordes, “Memory networks,” Proc. Int’lConf. Learning Representations (ICLR), 2015.

[52] X.-D. Zhou, D.-H. Wang, F. Tian, C.-L. Liu, and M. Nakagawa,“Handwritten Chinese/Japanese text recognition using semi-Markov

conditional random fields,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 35, no. 10, pp. 2484–2497, 2013.

[53] Q.-F. Wang, F. Yin, and C.-L. Liu, “Handwritten Chinese text recognitionby integrating multiple contexts,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 34, no. 8, pp. 1469–1481, 2012.

[54] R. Messina and J. Louradour, “Segmentation-free handwritten Chinesetext recognition with LSTM-RNN,” Proc. Int’l Conf. Document Analysisand Recognition (ICDAR), 2015.

[55] Y. Kato and M. Yasuhara, “Recovery of drawing order from single-stroke handwriting images,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 22, no. 9, pp. 938–949, 2000.

[56] Y. Qiao, M. Nishiara, and M. Yasuhara, “A framework toward restorationof writing order from single-stroked handwriting image,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1724–1737, 2006.

[57] A. Lamb, V. Dumoulin, and A. Courville, “Discriminative regularizationfor generative models,” arXiv:1602.03220, 2016.

[58] D. Im, C. Kim, H. Jiang, and R. Memisevic, “Generating images withrecurrent adversarial networks,” arXiv:1602.05110, 2016.