research article multilingual text detection with...

Research ArticleMultilingual Text Detection with Nonlinear Neural Network

Lin Li12 Shengsheng Yu2 Luo Zhong1 and Xiaozhen Li2

1School of Computer Science and Technology Wuhan University of Technology Wuhan 430070 China2School of Computer Science Huazhong University of Science and Technology Wuhan 430070 China

Correspondence should be addressed to Lin Li lilinwzygmailcom

Received 11 July 2015 Accepted 2 September 2015

Academic Editor Xinguang Zhang

Copyright copy 2015 Lin Li et al This is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Multilingual text detection in natural scenes is still a challenging task in computer vision In this paper we apply an unsu-pervised learning algorithm to learn language-independent stroke feature and combine unsupervised stroke feature learning andautomatically multilayer feature extraction to improve the representational power of text featureWe also develop a novel nonlinearnetwork based on traditional Convolutional Neural Network that is able to detect multilingual text regions in the images Theproposedmethod is evaluated on standard benchmarks andmultilingual dataset and demonstrates improvement over the previouswork

1 Introduction

Texts within images contain rich semantic informationwhich can be very beneficial for visual information retrievaland image understanding Driven by a variety of real-world applications scene text detection and recognition havebecome active research topics in computer vision To effi-ciently read text from photography the majority of methodsfollow the intuitive two-step process text detection followedby text recognition [1] To a large extent the performanceof text detection affects the accuracy of text recognitionExtracting textual information from natural scenes is acritical prerequisite for further text recognition and otherimage analysis tasks

Text detection has been considered in many studies andconsiderable progress has been achieved in recent years[2ndash14] However most of the text detection methods havefocused on English few investigations have been done on theproblem of the multilingual text detection In our daily livesmultilingual texts coexist everywhere many environmentscontain two or more scripts text in a single image and forexample product tags street signs license plates billboardsand guide information More and more applications need toachieve text detection regardless of language type

For different languages the characters takemanydifferentforms and have inconsistent heights strokes and writing

format There are thousands of languages in the world andthe representative and universal features of multilingual textare still unknown In addition text embedded in images canbe in variation of font style and size different alignment andorientation unknown colors and varying lighting conditionDue to these factors multilingual text detection in naturalscenes is a challenging and important problem

Our study is focused on learning the general stroke fea-ture representations and detecting text from image even in amultiscript environment Unlike traditional methods whichmainly relied on the combination of a number of hand-engineered features we aim to test the feasibility of proposinga common text detector only using automatically learningtext feature by improving discriminative clustering algo-rithm to obtain language-independent stroke features Thelearned stroke features incorporating with nonlinear neuralnetwork provide an alternative way to effectively increase thecharacter representational power To use deep learning textfeature we are able to use simple nonmaximal suppression tolocate text

In the following we first reviewed the recent publishedliterature followed by the proposed multilingual text detec-tion method from Section 3 to Section 4 In Section 5 theexperimental evaluation is presentedThe paper is concludedin Section 6

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2015 Article ID 431608 7 pageshttpdxdoiorg1011552015431608

2 Mathematical Problems in Engineering

2 Related Work

Existing methods proposed for text detection in naturalscenes can be broadly categorized into two groups connectedcomponent methods and sliding window methods

Connected component methods separate text and non-text information at pixel-level group text pixels to constructcharacter candidates from images by connected componentanalysis Epshtein et al [2] leveraged the idea of the recoveryof stroke width and proposed using the CCs in a strokewidth transformed image Yao et al [3] extract regionsin the Stroke Width Transform (SWT) domain Neumannand Matas [4] posed the character detection problem as anefficient sequential selection from the set of Extremal Regions(ERs) Chen et al [5] determined the stroke width usingthe distance transform of edge-enhanced Maximally StableExtremal Regions (MSER)UsingMSERs asCCs representingcharacters has become the focus of several recent works [6ndash9]

Sliding window-based methods also known as region-based methods scan a sliding subwindow through the imageto search for possible texts and then use machine learningtechniques to locate text Wang et al [10] extending theirprevious work [11] have built an end-to-end scene textrecognition system based on a sliding window characterclassifier using Random Ferns Wang et al [12] use multi-layer neural networks for text detection Jaderberg et al[13] achieve state-of-the-art performance by implementingslidingwindowdetection as a byproduct of theConvolutionalNeural Network (CNN)

In the task of multilingual text detection previous studiesare mostly sliding window-based method In [14 16] authorshave proposed similar methods using hand-engineered fea-tures to describe the text Subwindow scanned on differentscales and positions on the image pyramid in order to classifytexts in images Therefore to achieve text detection which isinvariance to language type the feature representation is veryimportant However little research attempted to apply deeplearning to learn multilingual text feature CNN is a specialkind of neural network and its deep nonlinear networkstructure shows the strong ability of learning discriminativefeatures of datasets from observation samples Thereforewe alternatively investigate the problem of multilingual textdetection based on the framework of CNN

3 Stroke Feature Learning

According to the study of linguistics the basic feature oftext is stroke such as the Latin alphabet and Chinese basicstokes And different languages share the same characteristicsin appearance Inspired by these ideas it is possible thatlanguage-independent stroke features can be designed

In order to copewithmultilingual scenes we seek to learna bank of universal low-level stroke features directly fromraw images The learning stroke features should be able tocapture the essential substructures of strokes At the sametime they are of the most representative and discriminativestroke features Many unsupervised learning algorithms canbe used for learning the hidden data prototypes from dataset

Figure 1 Training samples for stroke feature learning

such as 119870-means clustering and sparse coding The goal ofsparse coding is to construct a dictionary 119863 and minimizethe error in reconstruction min

119863119904sum119894119863119904(119894)

minus119909(119894)2

2+120582119904(119894)1

so that a data vector 119909(119894)

isin R119899 (119894 = 1 119898) can bemapped to a code vector 119904

(119894) For every 119904(119894) sparse coding

algorithm is required to repeatedly solve a convex optimiza-tion problem When applied to large scale image data theoptimization problem during the sparse coding procedure isvery expensive Relatively speaking the optimal 119904(119894) in classicK-means algorithm is simply as follows

119904(119894)

119895=

119863(119895)⊤

119909(119894) if 119895 = argmax

119895

10038161003816100381610038161003816119863(119895)⊤

119909(119894)10038161003816100381610038161003816

0 otherwise(1)

In addition 119870-means has been identified as a fast andeffective method to learn feature from images by computervision researchers Therefore we improve the variant 119870-means clustering method proposed by Coates et al [18] anduse it to learn stroke feature representations since it learnsrepresentative stroke features from large collections whilemuch faster

In particular we first collect a set of training imageswhich are 32 times 32 gray scale images extracted from ICDAR2003 ICDAR 2011 and ICDAR 2013 dataset multilingualdataset and Google It contains a character in the middleof each image Characters in training images include 26uppercase letters 26 lowercase letters 10 digits 20 Chinesebasic strokes and 28 Arabic alphabets Some training imagesused for stroke feature learning are illustrated in Figure 1 Werandomly extract 119898 8 times 8 pixel patches from images Beforetraining the cropped patches we apply contrast normalizedpreprocessing for each patch In order to avoid generatingmany highly correlated stroke features ZAC whitening isused for the patches to yield vectors 119909(119894) isin R64 119894 isin 1 119898

Because 119870-means algorithm is highly dependent on theinitialization process the different initial guess of centroidsaffects the clustering result seriously In order to lead todesirous clustering result we propose a novel initializationmethod to choose suitable initial stroke features We intro-duce the dispersion metric in the local information of dataguaranteeing the selection of initial centroids from the localspatial data-intensive region and the centroids apart from

Mathematical Problems in Engineering 3

Input the patches 119909 (119909(119894)

isin R64 119894 = 1 119898)

Output initial features 1198650isin R64times1198991

(1) 119862 larr 0

(2) construct 119908119894119895based on (2)

(3) computer dispersion metric 119889(119894)

= sum119898

119895=1119908119894119895and threshold 119902 = median(119889)

(4) for all data in 119909

(5) if data 119909(119895) with dispersion metric 119889

(119895)gt 119902

(6) 119862 larr 119862 cup 119909(119895)

(7) end if(8) end for(9) 119865(1)0

= 119909(119894) 119909(119894) is random selected from 119862

(10) 119865(2)0

= argmax119909(119895)

10038171003817100381710038171003817119909(119895)

minus 119865(1)

0

10038171003817100381710038171003817 forall119909

(119895)isin 119862

(11) set 119896 = 2

(12) repeat(13) 119896 = 119896 + 1

(14) 119865(119896)

0= argmax119909(119895)

min 10038171003817100381710038171003817119909(119895)

minus 119865(119905)

0

10038171003817100381710038171003817 forall119909(119895)

isin 119862 119905 = 1 119896 minus 1

(15) until 119896 = 1198991

Algorithm 1 Stroke feature initialization method

each other with a certain distance Our initialization frame-work includes three steps (1) estimating local dispersionmetric for each set of data (2) selecting the data whichhave higher metric than a threshold as candidates for initialfeatures and (3) determining initial stroke features fromcandidates The implements of the proposed initializationmethod are as follows We firstly construct an adjacencygraph and Grammatrix Grammatrix is computed accordingto the following

119908119894119895=

1 if 10038171003817100381710038171003817119909(119894)

minus 119909(119895)10038171003817100381710038171003817

lt 120576

0 otherwise

120576 =

sum119898

119894=1sum119898

119895=1dis (119909(119894) 119909(119895))

119898 times (119898 minus 1)times 05

(2)

where dis(119909(119894) 119909(119895)) is the distance of the patches 119909(119894) and 119909(119895)

Secondly we introduce the dispersion metric 119889 with 119898

components whose entries are given by 119889(119894)

= sum119898

119895=1119908119894119895(119894 =

1 119898) Set a threshold 119902 = median(119889) if the value of119889(119894) associated with the data 119909

(119894) is larger than threshold 119909(119894)is marked as a candidate of initial features Then we use analgorithm similar to [19] to select initial stroke features fromcandidates Becausewe use the stroke features as the first layerconvolution kernels of our proposed CNN 119899

1is the number

of first layer convolutional filters The detailed steps of theproposed initialization method are presented in Algorithm 1

After initializing 1198650 we learn stroke features 119865 isin R64times1198991

according to the following

min119865119904(119894)

sum

119894

10038171003817100381710038171003817119865119904(119894)

minus 119909(119894)10038171003817100381710038171003817

2

(3)

For all 119895 isin 1 1198991 compute inner products 119865(119895)⊤119909(119894) Set

the value of 119896 which equals the value of 119895 which maximizes

the inner products If 119895 = 119896 then 119904(119894)

119895= 119865(119895)⊤

119909(119894) or else

119904(119894)

119895= 0Then fix 119904

(119894) minimizing (3) to obtain 119865The optimi-zation is done by alternating minimization over 119865 and 119904

(119894)The full stroke feature learning algorithm with 119870-means issummarized in Algorithm 2

For general clustering algorithm the number of clusters isknown in advance or set by prior knowledge In our methodthe learned stroke features incorporate with ConvolutionalNeural Network classifier for text detection Therefore wefurther study how to choose the appropriate number offeatures to achieve the highest textno text classification accu-racy In order to analyze the impact of learned stroke featurenumber we learned four stroke feature sets with differentnumber Given that the first layer convolution kernels of ourCNNs have 119899

1= 96 128 256 and 320 we train detector with

different stroke feature sets Evaluate the performance of thedetection model at the subset of ICDAR 2003 test images Asshown in Figure 2 the 119865-measure increases as 119899

1gets larger

Once 1198991equals 256 the recall is at maximum value and

about 80 of detected text matches ground truthWhile 1198991is

greater than 256119865-measure is not increased and even slightlyreduced Based on our detailed analysis in our method weselect 119899

1= 256

4 Multilingual Text Detection

The idea of our text detection is to design ldquofeature learningrdquopipeline that can lead to representative text features anduse these features for detecting multilingual text Two maincomponents in this pipeline are as follows (1) use theunsupervised clustering algorithm to generate a set of strokefeatures 119865 (2) build a hierarchy network and combine it withstroke features 119865 to learn a high-level text feature The firstcomponent has been described in detail in Section 3 How to


Input119898 8 times 8 input patches 119909(119894) isin R64

Output learning stroke features 119865 isin R64times1198991

Procedure(1) Normalize input

119909(119894)

=

119909(119894)

minusmean (119909(119894))

radic10 + 119907119886119903 (119909(119894))

(2) ZAC whiten input119881119863119881⊤= cov(119909)

119909(119894)

= 119881(119863 + 01 times 119868)minus12

119881⊤

119909(119894)

(3) Initialize 119865 follow the steps in Algorithm 1(4) Repeat

Set 119904(119894)119896

= 119865(119895)⊤

119909(119894) for 119896 = argmax

119895|119865(119895)⊤

119909(119894)|

Set 119904(119894)119895

= 0 for all other 119895 = 119896

Fix 119904(119894) min

119865119904(119894) sum119894119865119904(119894)

minus 119909(119894)2 st 119904(119894)

1= 119904(119894)infinand 119865

(119895)2= 1

Until convergence or reach iteration limitation

Algorithm 2 Stroke feature learning algorithm

Number of stroke features100 150 200 250 300

Clas

sifica

tion

accu

racy

(tex

tno

text

)

062064066068

07072074076078

08082

RecallPrecisionF-measure

n = 96

n = 128

n = 256 n = 320

Figure 2 The accuracy analysis on different stroke feature number

build and train the multilayer neural network is presented inSection 4

By making several technical changes over traditionalConvolutional Neural Network architectures [20] wedevelop a novel classifier for multilingual text detectionWe have two major improvements (1) different fromtraditional method that convolution kernel of CNN israndomly generated we select the unsupervised learningstroke features 119865 as the first layer convolution kernels ofour network (2) the intermediate features obtained fromthe learning process which function in the second layersconvolution kernels can be used to more efficiently extracttext features

Our network has two convolutional layers with 1198991and 1198992

filters respectively We fix the filters in the first convolutionlayer which are stroke features learned in Section 3 so low-level filters are 119865 isin R64times1198991 and 119899

1= 256 We build a set

of labeled training datasets all training images are 32 times 32

fixed-sized images (8877 positive 9000 negative) Starting

from the first layer given an input image the input is agrayscale cropped training image that is 119911

0= 119909 The

input convolves with 256 filters of size 8 times 8 resulting ina map of size 25 times 25 (to avoid boundary effects) and 256channels The first convolutional layer output 119911

1is a new

feature map computed by a nonlinear response function 1199111=

max0 |119865⊤119909| minus 120572 where 120572 = 05 Convolutional layerscan be intertwined with pooling layers that simplify systemparameters by statistical aggregation of features We averagepool over the first convolutional layer responsemap to reducethe size down to 5 times 5 The sequence continues by anotherconvolutional and pooling layers resulting in feature mapswith 256 channels and size of 2 times 2 this size is the same asthe dimension of the second layer convolutional filters Thesecond layer outputs are fully connected to the classificationlayer The SVM classifier is used as a binary classifier thataims to estimate whether a 32 times 32 image contains text Wetrain the network using stochastic gradient descent and back-propagation Classification error function includes loss termand regularization term Loss term is a squared hinge loss andthe norm used in the penalization is L2 We also use dropoutin the second convolutional layer to help prevent over fittingThe structure of the proposed neural network is presented inFigure 3

After our network has been trained the detection processstarts from a large raw pixel input image and leverages theconvolutional structure of the CNN to process the entireimage We slide a 32 times 32 pixelsrsquo window across an inputimage and put these sliding windows to the learned classifierUse the intermediate hidden layers as features to classifytextno text and generate text bounding boxes We set 12different scales in our detection method At a certain scale119904 the input imagersquos scale changes the sliding window scansthrough the scaled image At each point (119909 119910) if windowscontain single centered characters produce positive detectorresponse 119877

119904[119909 119910] In each row 119903 of the scaled image check

whether there are119877119904[119909 119910] gt 0 If there exists positive detector

response then form a line-level bounding box 119871119903

119904with


z0

32 times 32 times 1

z1

25 times 25 times 256

5 times 5 times 256 4 times 4 times 256

2 times 2 times 256

Average poolingThe second layer

SVM classifierThe first layer

F first layer convolution kernels

256Input image

Figure 3 The proposed network for multilingual text detection

the same height as the sliding window at scale 119904 And max(119909)and min(119909) are defined as the left and right boundaries of 119871119903

119904

At each scale the input image is resizing and a set of candidatetext bounding boxes are generated independently The aboveprocedure was repeated 12 times and yields groups of possiblyoverlapping bounding boxes We then apply nonmaximalsuppression (NMS) to score each box and remove all boxesthat overlaps by more than 50 with lower score and obtainthe final text bounding boxes 119871

5 Experiments

51 Dataset To evaluate the effectiveness and robustness ofthe proposed text detection algorithm we have conductedexperiments on standard benchmarks including the chal-lenging datasets ICDAR 2003 [21] MSRA-TD500 [3] andKAIST [17]

The ICDAR 2003 Robust Reading and Text Locatingdatabase is a widely used public dataset for scene text detec-tion algorithm The database contains 258 training imagesand 251 testing images It contains the path to each image andtext bounding box annotations for each image

MSRA-TD500 dataset contains images with text inEnglish andChineseThedataset contains 500 images in totalwith varying resolutions from 1296 times 864 to 1920 times 1280These images are taken from indoor (office and mall) andoutdoor (street) scenes using a packet camera

KAIST provides a scene text dataset consisting of 3000images of indoor and outdoor scenes Word and characterbounding boxes are provided as well as segmentationmaps ofcharacters Texts in KAIST images are English Korean and amixture of English and Korean

We also created a new multilingual dataset that is com-posed of three representative languages English Chineseand Arabic These three languages stand for three typesof writing systems English standing for alphabet Chinesestanding for ideograph and Arabic standing for abjad Eachgroup corresponding to the one language contains 80 images

To learn the stroke feature train samples include 5980English text samples 800 Chinese text samples and 1100 Ara-bic text samples Then 3000 nontext samples are extractedfrom 200 negative images using bootstrap method All thesesamples are normalized to 32 times 32 which is consistent withthe detected window

Table 1 Performance of the proposed method

Language Total image Precision Recall 119865-measureEnglish 800 076 088 08Korean 500 073 084 078Chinese 200 068 096 079Arabic 200 070 072 066

Figure 4 Text detection samples on different language images

52 Results The proposed algorithm is implemented usingIntel Core i5 processor at 29GHz 8GB RAM and MATLABR2014b

To validate the performance of our proposed algorithmwe use the definitions in ICDAR 2003 competition [21] fortext detection precision recall and 119865-measure calculationTherefore 119875 = sum

119903119890isin119864

119898(119903119890 119879)|119864| and 119877 = sum

119903119905isin119879

119898(119903119905 119864)|119879|

where 119898(119903 119877) is the best match for a rectangle 119903 in a set ofrectangles 119877 119864 and T which are our estimated rectanglesand the ground truth rectangles respectively We adopt the119865-measure to combine the precision and recall figures into asingle measure of quality 119865 = 1(120572119875 + 120572119877) where 120572 = 05

Experiments are carried out on a set of images containingtext in four different languages namely English ChineseArabic and Korean English text images are selected fromICDAR 2003 ICDAR 2011 and ICDAR 2013 Korean imagesfrom KAIST some Chinese images from MSRA-TD500 andthe other from multilingual dataset and Arabic imagesfrom multilingual dataset and Google The results of theseevaluations are summarized in Table 1 As can be seen fromTable 1 119865-measures on different language are close to eachother except Arabic because the Arabic special nature ofcontinuous writing style makes the recall of this script lowerThe experiment result indicates that our method is nottuned to any particular language and performs approximatelyequally good on all the scripts

Figure 4 shows some texts successful detection by our sys-tem on images containing different language text Althoughthe texts contained in training samples are only in EnglishChinese and Arabic our method can detect the text notonly in three representative languages but also in a numberof other languages such as French German Korean andJapanese This shows that our method has some robustness


Table 2 Performance comparison on different benchmarks

Method Dataset Precision Recall 119865-measurePan [15] ICDAR 2003 0645 0659 0652Zhou [16] ICDAR 2003 037 079 053Our method ICDAR 2003 045 080 057Lee [17] KAIST 069 060 064Our method KAIST 059 079 067

Figure 5 Text detection samples on images containing two differentlanguages text

We also pickedmethods proposed by Zhou et al [16] Panet al [15] and Lee et al [17] for further consideration Thesealgorithms have good results on the standard benchmarksand use different approaches to detect text The performancecomparison analysis can be seen in Table 2 Our method hasachieved high recall at different benchmarks It also reflectsthe representative of our learned feature is strong which cansuccessfully detect all the information associatedwith the textin images

But our test results are not good with 119875119877119865-means of030032031 on theMSRA-TD500 dataset Shi et al [6] haveachieved the state-of-the-art text detection performance with05205305 on the same dataset The main reason is thatthe MSRA-TD500 dataset is created for the purpose of studyof multiorientation text detection which has a lot of imagescontaining no horizontal text lines But our method gives thetext bounding boxes based on the horizontal direction

Figure 5 shows some other test samples The resultsreflect our method is efficient on the circumstance that asingle image contains two or more different languages textsand numbers The bottom row in Figure 5 shows some failsamples some of these problems are miss detection for partof Arabic text because Arabic words mostly are linked bycontinuous line In this case use of the stroke feature to detecttext is not sufficient Stroke width of the implementationis essential for such languages as Arabic There are otherproblems caused by the interference terms which have theappearance similar to the text

6 Conclusion

The aim of the study is to propose a multilingual textdetection method Traditional methods in this area mainly

rely on large amounts of hand-engineered features or priorknowledge Our work is distinct in two ways (1) we useprimitive stroke feature learned by unsupervised learningalgorithm as network convolutional kernels (2) we leveragethe trained multilayer neural network to learn high-levelabstract text features used for detector Experiments onthe public benchmark and multilingual dataset show ourmethod is able to localize text regions of different scripts innatural scene imagesThe experiment results demonstrate therobustness of the proposed method

From the failed samples in the experiments we analyzethe limitations of our technology for further improvementOn the one hand some languages have continuous writingstyle like Arabic automatically learning features are notenough for detection the connected components analysiswillbe added into our method to improve the precision of finalresults On the other handmultiorientation text problemwillbe considered

Conflict of Interests

The authors declared that they have no conflict of interestsregarding this work

References

[1] X Chen and A L Yuille ldquoDetecting and reading text in naturalscenesrdquo in Proceedings of the IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo04) vol 2pp II366ndashII373 July 2004

[2] B Epshtein E Ofek and Y Wexler ldquoDetecting text in naturalscenes with stroke width transformrdquo in Proceedings of the IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo10) pp 2963ndash2970 June 2010

[3] C Yao X Bai W Liu Y Ma and Z Tu ldquoDetecting texts ofarbitrary orientations in natural imagesrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 1083ndash1090 June 2012

[4] L Neumann and J Matas ldquoReal-time scene text localizationand recognitionrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR rsquo12) pp 3538ndash3545 Providence RI USA June 2012

[5] H Chen S S Tsai G Schroth D M Chen R Grzeszczuk andB Girod ldquoRobust text detection in natural images with edge-enhanced maximally stable extremal regionsrdquo in Proceedingsof the 18th IEEE International Conference on Image Processing(ICIP rsquo11) pp 2609ndash2612 September 2011

[6] C Shi C Wang B Xiao Y Zhang and S Gao ldquoScenetext detection using graph model built upon maximally stableextremal regionsrdquo Pattern Recognition Letters vol 34 no 2 pp107ndash116 2013

[7] A Shahab F Shafait and A Dengel ldquoICDAR 2011 robustreading competition challenge 2 reading text in scene imagesrdquoin Proceedings of the 11th International Conference on DocumentAnalysis and Recognition (ICDAR rsquo11) pp 1491ndash1496 BeijingChina September 2011

[8] X-C Yin X Yin K Huang and H-W Hao ldquoRobust textdetection in natural scene imagesrdquo IEEETransactions onPatternAnalysis and Machine Intelligence vol 36 no 5 pp 970ndash9832014


[9] L Neumann and J Matas ldquoScene text localization and recogni-tion with oriented stroke detectionrdquo in Proceedings of the 14thIEEE International Conference on Computer Vision (ICCV rsquo13)pp 97ndash104 December 2013

[10] K Wang B Babenko and S Belongie ldquoEnd-to-end scene textrecognitionrdquo inProceedings of the IEEE International Conferenceon Computer Vision (ICCV rsquo11) pp 1457ndash1464 IEEE BarcelonaSpain November 2011

[11] K Wang and S Belongie ldquoWord spotting in the wildrdquo inProceedings of the 11th European Conference on Computer Vision(ECCV rsquo10) pp 591ndash604 2010

[12] T Wang D J Wu A Coates and A Y Ng ldquoEnd-to-end textrecognitionwith convolutional neural networksrdquo inProceedingsof the 21st International Conference on Pattern Recognition(ICPR rsquo12) pp 3304ndash3308 November 2012

[13] M Jaderberg A Vedaldi and A Zisserman ldquoDeep featuresfor text spottingrdquo in Proceedings of the 22nd InternationalConference on Pattern Recognition (ICPR rsquo14) 2014

[14] A Raza I Siddiqi C Djeddi and A Ennaji ldquoMultilingualartificial text detection using a cascade of transformsrdquo inProceedings of the 12th International Conference on DocumentAnalysis andRecognition (ICDAR rsquo13) pp 309ndash313WashingtonDC USA August 2013

[15] Y-F Pan X Hou and C-L Liu ldquoA hybrid approach to detectand localize texts in natural scene imagesrdquo IEEE Transactionson Image Processing vol 20 no 3 pp 800ndash813 2011

[16] G Zhou Y LiuQMeng andY Zhang ldquoDetectingmultilingualtext in natural scenerdquo in Proceedings of the 1st InternationalSymposium on Access Spaces (ISAS rsquo11) pp 116ndash120 IEEEYokohama Japan June 2011

[17] S Lee M S Cho K Jung and J H Kim ldquoScene text extractionwith edge constraint and text collinearityrdquo in Proceedings of the20th International Conference on Pattern Recognition (ICPR rsquo10)pp 3983ndash3986 August 2010

[18] A Coates B Carpenter C Case et al ldquoText detection and char-acter recognition in scene images with unsupervised featurelearningrdquo in Proceedings of the 11th International Conference onDocument Analysis and Recognition (ICDAR rsquo11) pp 440ndash445September 2011

[19] M E Celebi andH A Kingravi ldquoDeterministic initialization ofthe k-means algorithm using hierarchical clusteringrdquo Interna-tional Journal of Pattern Recognition and Artificial Intelligencevol 26 no 7 Article ID 1250018 2012

[20] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2324 1998

[21] S Lucas P Panaretos L Sosa A Tang S Wong and R YoungldquoICDAR 2003 robust reading competitionsrdquo in Proceedings ofthe 7th International Conference on Document Analysis andRecognition (ICDAR rsquo03) pp 682ndash687 Edinburgh UK August2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of


Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of


Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of


Mathematical PhysicsAdvances in

Complex AnalysisJournal of


OptimizationJournal of


CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of


Operations ResearchAdvances in

Journal of


Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences


The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Algebra

Discrete Dynamics in Nature and Society



Decision SciencesAdvances in

Discrete MathematicsJournal of


Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of


2 Related Work

Existing methods proposed for text detection in naturalscenes can be broadly categorized into two groups connectedcomponent methods and sliding window methods

Connected component methods separate text and non-text information at pixel-level group text pixels to constructcharacter candidates from images by connected componentanalysis Epshtein et al [2] leveraged the idea of the recoveryof stroke width and proposed using the CCs in a strokewidth transformed image Yao et al [3] extract regionsin the Stroke Width Transform (SWT) domain Neumannand Matas [4] posed the character detection problem as anefficient sequential selection from the set of Extremal Regions(ERs) Chen et al [5] determined the stroke width usingthe distance transform of edge-enhanced Maximally StableExtremal Regions (MSER)UsingMSERs asCCs representingcharacters has become the focus of several recent works [6ndash9]

Sliding window-based methods also known as region-based methods scan a sliding subwindow through the imageto search for possible texts and then use machine learningtechniques to locate text Wang et al [10] extending theirprevious work [11] have built an end-to-end scene textrecognition system based on a sliding window characterclassifier using Random Ferns Wang et al [12] use multi-layer neural networks for text detection Jaderberg et al[13] achieve state-of-the-art performance by implementingslidingwindowdetection as a byproduct of theConvolutionalNeural Network (CNN)

In the task of multilingual text detection previous studiesare mostly sliding window-based method In [14 16] authorshave proposed similar methods using hand-engineered fea-tures to describe the text Subwindow scanned on differentscales and positions on the image pyramid in order to classifytexts in images Therefore to achieve text detection which isinvariance to language type the feature representation is veryimportant However little research attempted to apply deeplearning to learn multilingual text feature CNN is a specialkind of neural network and its deep nonlinear networkstructure shows the strong ability of learning discriminativefeatures of datasets from observation samples Thereforewe alternatively investigate the problem of multilingual textdetection based on the framework of CNN

3 Stroke Feature Learning

According to the study of linguistics the basic feature oftext is stroke such as the Latin alphabet and Chinese basicstokes And different languages share the same characteristicsin appearance Inspired by these ideas it is possible thatlanguage-independent stroke features can be designed

In order to copewithmultilingual scenes we seek to learna bank of universal low-level stroke features directly fromraw images The learning stroke features should be able tocapture the essential substructures of strokes At the sametime they are of the most representative and discriminativestroke features Many unsupervised learning algorithms canbe used for learning the hidden data prototypes from dataset

Figure 1 Training samples for stroke feature learning

such as 119870-means clustering and sparse coding The goal ofsparse coding is to construct a dictionary 119863 and minimizethe error in reconstruction min

119863119904sum119894119863119904(119894)

minus119909(119894)2

2+120582119904(119894)1

so that a data vector 119909(119894)

isin R119899 (119894 = 1 119898) can bemapped to a code vector 119904

(119894) For every 119904(119894) sparse coding

algorithm is required to repeatedly solve a convex optimiza-tion problem When applied to large scale image data theoptimization problem during the sparse coding procedure isvery expensive Relatively speaking the optimal 119904(119894) in classicK-means algorithm is simply as follows

119904(119894)

119895=

119863(119895)⊤

119909(119894) if 119895 = argmax

119895

10038161003816100381610038161003816119863(119895)⊤

119909(119894)10038161003816100381610038161003816

0 otherwise(1)

In addition 119870-means has been identified as a fast andeffective method to learn feature from images by computervision researchers Therefore we improve the variant 119870-means clustering method proposed by Coates et al [18] anduse it to learn stroke feature representations since it learnsrepresentative stroke features from large collections whilemuch faster

In particular we first collect a set of training imageswhich are 32 times 32 gray scale images extracted from ICDAR2003 ICDAR 2011 and ICDAR 2013 dataset multilingualdataset and Google It contains a character in the middleof each image Characters in training images include 26uppercase letters 26 lowercase letters 10 digits 20 Chinesebasic strokes and 28 Arabic alphabets Some training imagesused for stroke feature learning are illustrated in Figure 1 Werandomly extract 119898 8 times 8 pixel patches from images Beforetraining the cropped patches we apply contrast normalizedpreprocessing for each patch In order to avoid generatingmany highly correlated stroke features ZAC whitening isused for the patches to yield vectors 119909(119894) isin R64 119894 isin 1 119898

Because 119870-means algorithm is highly dependent on theinitialization process the different initial guess of centroidsaffects the clustering result seriously In order to lead todesirous clustering result we propose a novel initializationmethod to choose suitable initial stroke features We intro-duce the dispersion metric in the local information of dataguaranteeing the selection of initial centroids from the localspatial data-intensive region and the centroids apart from



isin R64 119894 = 1 119898)


(1) 119862 larr 0



= sum119898




(119895)gt 119902

(6) 119862 larr 119862 cup 119909(119895)

(7) end if(8) end for(9) 119865(1)0


(10) 119865(2)0

= argmax119909(119895)

10038171003817100381710038171003817119909(119895)

minus 119865(1)

0

10038171003817100381710038171003817 forall119909

(119895)isin 119862

(11) set 119896 = 2

(12) repeat(13) 119896 = 119896 + 1

(14) 119865(119896)

0= argmax119909(119895)

min 10038171003817100381710038171003817119909(119895)

minus 119865(119905)

0

10038171003817100381710038171003817 forall119909(119895)

isin 119862 119905 = 1 119896 minus 1

(15) until 119896 = 1198991



119908119894119895=

1 if 10038171003817100381710038171003817119909(119894)

minus 119909(119895)10038171003817100381710038171003817

lt 120576

0 otherwise

120576 =

sum119898

119894=1sum119898

119895=1dis (119909(119894) 119909(119895))


(2)




= sum119898

119895=1119908119894119895(119894 =



1is the number




min119865119904(119894)

sum

119894

10038171003817100381710038171003817119865119904(119894)

minus 119909(119894)10038171003817100381710038171003817

2

(3)




119895= 119865(119895)⊤

119909(119894) or else

119904(119894)

119895= 0Then fix 119904






1gets larger




1= 256







119909(119894)

=

119909(119894)

minusmean (119909(119894))

radic10 + 119907119886119903 (119909(119894))


119909(119894)

= 119881(119863 + 01 times 119868)minus12

119881⊤

119909(119894)


Set 119904(119894)119896

= 119865(119895)⊤

119909(119894) for 119896 = argmax

119895|119865(119895)⊤

119909(119894)|

Set 119904(119894)119895

= 0 for all other 119895 = 119896

Fix 119904(119894) min

119865119904(119894) sum119894119865119904(119894)

minus 119909(119894)2 st 119904(119894)

1= 119904(119894)infinand 119865

(119895)2= 1




Clas

sifica

tion

accu

racy

(tex

tno

text

)

062064066068

07072074076078

08082


n = 96

n = 128

n = 256 n = 320










0= 119909 The


1is a new







119904with


z0

32 times 32 times 1

z1



2 times 2 times 256




256Input image



119904


5 Experiments












119903119890isin119864

119898(119903119890 119879)|119864| and 119877 = sum

119903119905isin119879

119898(119903119905 119864)|119879|











6 Conclusion






References






























Volume 2014




Journal of











Journal of


Function Spaces






Algebra











isin R64 119894 = 1 119898)


(1) 119862 larr 0



= sum119898




(119895)gt 119902

(6) 119862 larr 119862 cup 119909(119895)

(7) end if(8) end for(9) 119865(1)0


(10) 119865(2)0

= argmax119909(119895)

10038171003817100381710038171003817119909(119895)

minus 119865(1)

0

10038171003817100381710038171003817 forall119909

(119895)isin 119862

(11) set 119896 = 2

(12) repeat(13) 119896 = 119896 + 1

(14) 119865(119896)

0= argmax119909(119895)

min 10038171003817100381710038171003817119909(119895)

minus 119865(119905)

0

10038171003817100381710038171003817 forall119909(119895)

isin 119862 119905 = 1 119896 minus 1

(15) until 119896 = 1198991



119908119894119895=

1 if 10038171003817100381710038171003817119909(119894)

minus 119909(119895)10038171003817100381710038171003817

lt 120576

0 otherwise

120576 =

sum119898

119894=1sum119898

119895=1dis (119909(119894) 119909(119895))


(2)




= sum119898

119895=1119908119894119895(119894 =



1is the number




min119865119904(119894)

sum

119894

10038171003817100381710038171003817119865119904(119894)

minus 119909(119894)10038171003817100381710038171003817

2

(3)




119895= 119865(119895)⊤

119909(119894) or else

119904(119894)

119895= 0Then fix 119904






1gets larger




1= 256







119909(119894)

=

119909(119894)

minusmean (119909(119894))

radic10 + 119907119886119903 (119909(119894))


119909(119894)

= 119881(119863 + 01 times 119868)minus12

119881⊤

119909(119894)


Set 119904(119894)119896

= 119865(119895)⊤

119909(119894) for 119896 = argmax

119895|119865(119895)⊤

119909(119894)|

Set 119904(119894)119895

= 0 for all other 119895 = 119896

Fix 119904(119894) min

119865119904(119894) sum119894119865119904(119894)

minus 119909(119894)2 st 119904(119894)

1= 119904(119894)infinand 119865

(119895)2= 1




Clas

sifica

tion

accu

racy

(tex

tno

text

)

062064066068

07072074076078

08082


n = 96

n = 128

n = 256 n = 320










0= 119909 The


1is a new







119904with


z0

32 times 32 times 1

z1



2 times 2 times 256




256Input image



119904


5 Experiments












119903119890isin119864

119898(119903119890 119879)|119864| and 119877 = sum

119903119905isin119879

119898(119903119905 119864)|119879|











6 Conclusion






References






























Volume 2014




Journal of











Journal of


Function Spaces






Algebra













119909(119894)

=

119909(119894)

minusmean (119909(119894))

radic10 + 119907119886119903 (119909(119894))


119909(119894)

= 119881(119863 + 01 times 119868)minus12

119881⊤

119909(119894)


Set 119904(119894)119896

= 119865(119895)⊤

119909(119894) for 119896 = argmax

119895|119865(119895)⊤

119909(119894)|

Set 119904(119894)119895

= 0 for all other 119895 = 119896

Fix 119904(119894) min

119865119904(119894) sum119894119865119904(119894)

minus 119909(119894)2 st 119904(119894)

1= 119904(119894)infinand 119865

(119895)2= 1




Clas

sifica

tion

accu

racy

(tex

tno

text

)

062064066068

07072074076078

08082


n = 96

n = 128

n = 256 n = 320










0= 119909 The


1is a new







119904with


z0

32 times 32 times 1

z1



2 times 2 times 256




256Input image



119904


5 Experiments












119903119890isin119864

119898(119903119890 119879)|119864| and 119877 = sum

119903119905isin119879

119898(119903119905 119864)|119879|











6 Conclusion






References






























Volume 2014




Journal of











Journal of


Function Spaces






Algebra










z0

32 times 32 times 1

z1



2 times 2 times 256




256Input image



119904


5 Experiments












119903119890isin119864

119898(119903119890 119879)|119864| and 119877 = sum

119903119905isin119879

119898(119903119905 119864)|119879|











6 Conclusion






References






























Volume 2014




Journal of











Journal of


Function Spaces






Algebra
















6 Conclusion






References






























Volume 2014




Journal of











Journal of


Function Spaces






Algebra






























Volume 2014




Journal of











Journal of


Function Spaces






Algebra









research article multilingual text detection with...

Documents