generation of synthetic datasets for performance evaluation of text/graphics document ocr mathieu...

Generation of Synthetic Datasets for Performance Evaluation of Text/Graphics Document OCR

Mathieu DelalandreCVC, Barcelona, Spain

DAG MeetingCVC, Barcelona, Spain

Wednesday 19th of November 2008

Introduction

Huge amount of data exist, two main sources

• Text/graphics documentsText/graphics documents are used in a variety of fields like geography, engineering, social sciences …Some examples are

architectural drawing utility map geographic map

digitized documents (modern and old) web images

Introduction

Character recognition system working with text/graphics documents # First related work [Brown’1979] # More than 50 references on this topic today [Fletcher’1988] [Zenzo’1992] [Goto’1999] [Adam’2000] …

• OCR of text/graphics documents

Problematics - letter segmentation - multi-font recognition - scale variation

- text/graphics separation - rotation variation - text-line detection - no reading order - no dictionary

general to any documents

specific to text/graphics documents

Text/Graphic

s separati

onText-line

detection

Character

segmentation

Character

recognition

full image of text-lines

images of single text-line

images of single character

ASCII

Introduction

The case of general OCR [Kanungo’1999]More than 40 references on the topic [Kanungo’1999]Several standard databases exist (NIST, MARS, CD-ROM English, …)Annual evaluation reports [Rice’1992] [Rice’1993]

Black-box evaluation: The evaluation considers the OCR system as an indivisible unit and evaluates it from its final results (i.e. OCR output vs. ASCII transcription of the text using string edit distances). White-box evaluation: The evaluation aims to characterize the performance of individual sub-modules of the OCR system (skewing, letter segmentation, block identification, character recognition, etc.).

Characterisation

GroundtruthGroundtruthGroundtruth

Groundtruthing

ResultsResultsResults

Performance evaluation

SystemResultsDocumentsDocuments

• About performance evaluation

The case of text/graphic document OCR [Wenyin’1997]Only 1 reference on the topicNo standard databasesNone complete evaluation done through 20 years of research

Introduction

• Scope of the proposed work

Performance evaluation of text/graphics document OCR # white-box evaluation # groundtruthing step # datasets for text/line detection and character recognition # generation algorithms are “simple”, the main purpose of the talk will concern the setting contributions

Plan

1. Groundtruth definition2. Datasets for character

recognition3. Datasets for text-line detection4. In progress datasets

Groundtruth definition

– Character level• ASCII code• font (name, size, style)• location point• orientated bounding box• orientation (ϴ)• scale ()

– Text level• first location point• groundtruth of

characters• characters/word

positionscha

rH e l l o W o r l d

p-wor

d0 0 0 0 0 1 1 1 1 1

p-cha

r0 1 2 3 4 0 1 2 3 4



Datasets for character recognition (1/2)

image

size

class

size learning

font(s)

rotation

scaling

Brown’1981

682 ??/10

20 000

× × yes yes

Zenzo’92

?? ??/62

72 000

× × yes yes

Takahashi’1992

242 ??/10

6 400

50% × yes yes

Adam’2000

282 51/62

15 000

33% × yes yes

Chen’2003

162-5122

26/26

1 000

14% 1 no yes

Choisy’2004

282 51/62

15 000

80% × yes yes

Hase’2004

322 ??/26

3 000

33% 3 yes no

Pal’2006

132-342

40/62

18 000

80% 2 yes yes

Roy’2008

132-742

40/62

8 000

80% many

yes yes

(1) (2) (3) (4) (5)



• Problematics

How to generate single character images ? Which number of class ? Which image resolution ? Which size for the datasets ? Which fonts ? Etc ….

• Published experiments

• Main conclusions

(1)The real sizes of characters can be only estimated.

(2)The confusion problem (e.g. 6 vs 9) is not still well defined, the 62 class problem (a-z A-Z 0-9) is the main goal.

(3)It is not possible to fix a standard size for the training/test sets, this information is still well defined, several thousands of images are mandatory for the training.

(4)The impact of fonts is few studied and should be take into account in the evaluation

(5)The invariance to rotation and scaling is the final goal, they are few studied independently.

Datasets for character recognition (2/2)

• Datasets

tests scaling

rotation

font(s)/

test

fonts images

3 no no 1 3 15 000

3 yes no 1 3 15 000

3 no yes 1 3 15 000

3 yes yes 1 3 15 000

• Generation setting

Geometry invariance

Font adequacy

Font scalability



15 000 +30 000 + 45 000 + 60 000

letter class

62 a-z; A-Z; 0-9

font class 30 fonts

http://www.codestyle.org/ with lower and upper case, no cursivebasic

fonts3 times, courier, arial

character size

322

pixelsmax dxdy of font symbols

dataset size

5 000 / font

62 classes; 40 samples/class; 50%/50%

training free ranked files allow a training specification 20% training on [file-4001 – file-5000]

character scaling

1.0 to 2.0

with a gap of 1/1000

character rotation

0 to 2×π

with a gap of π/500

• Generation algorithm font manager, centering, scale

and rotation processes

tests scaling

rotation

font(s)/

test

fonts images

4 yes yes 3; 6; 9; 12

12 150 000

tests scaling

rotation

font(s)/

test

fonts images

30 yes yes 1 30 150 000

Datasets for text-line detection (1/2)

• Problematicsuse-case ima

gestext-lines

curved

font/img

scaling

Roy’2008

geographic map

?? 5 000

yes many

yes

Pal’2004

artistic document

?? 1 521

yes many

yes

Loo’2002

poster, newspape

r

2 118 yes many

yes

Park’2001

poster, publicity

30 1265 yes many

yes

Goto’1999

Japanese form

170 9 831

yes many

yes

Tan’1998

map 8 96 no many

yes

He’1996 drawing 1 16 no many

yes

Burgue’1995

cadastral map

4 150 no many

yes

Deseilligny’1995

cadastral map

3 1 250

no many

yes

(1) (2) (3)



How to generate single character images ? Which number of word per image ? Which image size ? Which size for the datasets ? Which number of font ? Etc ….

• Main conclusions

(1)The use-cases are heterogeneous, the sizes and resolutions of images are few provided, the text density is then difficult to estimate, images with significant text content are preferred.

(2)Depending the use-cases, not all the methods work on curved text, a combination of curved and straight text is necessary.

(3)All the methods use context to extract the text-line (i.e. font type, character size, line model). The size of characters could change a lot, the number of font is generally small (less to ten).

Datasets for text-line detection (2/2)

test text-line/img

scaling

curved

font(s)/test

words

1 low yes no 3 in progress

1 medium yes no 3 in progress

1 high yes no 3 in progress

The insert algorithm step 1 step 2

132 llld dd y sin dd x cos

B1 ejects B2 of dx,dyl2

l1

l3

dydx

d

θ

B1

B2

22yx ddd

• Setting



• DatasetsText-line density

• Generation algorithm

test text-line/img

scaling

curved

font(s)/test

words

1 medium no no 9 in progress




Font context

test text-line/img

scaling

curved

font(s)/test

words


1 medium yes no 1 in progress

Size context

dictionary

422 text-lines

countries and capitals

font class

30 fonts http://www.codestyle.org/ with lower and upper case, no cursivecharacte

r size322

pixelsmax dxdy of font symbols

image size

6402 10-50 text-lines per image

dataset size

100 images

text scaling

1.0 to 1.5

with a gap of 1/1000

text rotation

-π/2 to +π/2

with a gap of π/500

In progress datasets 1. Groundtruth definition and

setting2. Datasets for character


Conclusions

Conclusions # in progress work … # character recognition datasets are ready # bags of words still under packaging, but will be ready soon.

Perspectives # middle term, experimentations with standard feature extraction

methods [Roy’2008] [Valveny’2007] # long term, experimentations with bags of word and text/graphics

documents [Delalandre’2007] [Wenyin’1997]

References (1/2)1. R. Brown and M. Lybanon and L. K. Gronmeyer. Recognition of Handprinted Characters for Automated Cartography: A

Progress Report. Proceedings of the SPIE, Vol. 205, 1979.2. L.A. Fletcher & R. Kasturi. A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images. Transactions on

Pattern Analysis and Machine Intelligence (PAMI), vol (10), pp. 910-918 , 1988. 3. S.D. Zenzo; M.D. Buno; M. Meucci & A. Spirito. Optical recognition of hand-printed characters of any size, position, and

orientation. IBM Journal of Research and Development, vol (36), pp. 487-501 , 1992. 4. H. Goto & H. Aso. Extracting curved text lines using local linearity of the text line. International Journal on Document

Analysis and Recognition (IJDAR), vol (2), pp. 111-119 , 1999. 5. S. Adam; J.M. Ogier; C. Cariou; R. Mullot; J. Labiche & J. Gardes. Symbol and Character Recognition : Application to

Engineering Drawings. International Journal on Document Analysis and Recognition (IJDAR), vol (3), pp. 89-101 , 2000. 6. T. Kanungo; G.A. Marton & O. Bulbu. Performance evaluation of two Arabic OCR products. Workshop on Advances in

Computer-Assisted Recognition (AIPR) , SPIE Proceedings, vol (3584), pp. 76-83 , 1999. 7. S.V. Rice J. Kanai & T.A. Nartker. A Report on the Accuracy of OCR Devices. Information Science Research Institute,

University of Nevada, USA, 1992.8. S.V. Rice; J. Kanai & T.A. Nartker. An Evaluation of OCR Accuracy. Information Science Research Institute, University of

Nevada, USA, 1993. 9. L. Wenyin & D. Dori. A Protocol for Performance Evaluation of Line Detection Algorithms. Machine Vision and Applications,

vol (9), pp. 240-250 , 1997.10. R.M. Brown. Handprinted Symbol Recognition System: A Very High Performance Approach To Pattern Analysis Of Free-form

Symbols. Conference Southeastcon , pp. 5-8 , 1981.11. H. Takahashi. Neural network architectures for rotated character recognition. International Conference on Pattern

Recognition (ICPR) , pp. 623-626 , 1992.12. Q. Chen. Evaluation of OCR algorithms for images with different spatial resolutions and noises. School of Information

Technology and Engineering, University of Ottawa, Canada, 2003.13. C. Choisy; H. Cecotti & A. Belaid. Character Rotation Absorption Using a Dynamic Neural Network Topology: Comparison

With Invariant Features. International Conference on Enterprise Information Systems (ICEIS) , pp. 90-97 , 2004.

References (2/2)14. H. Hase; T. Shinokawa; S. Tokai & C.Y. Suen. A robust method of recognizing multi-font rotated characters.

International Conference on Pattern Recognition (ICPR) , vol (2), pp. 363- 366 , 2004. 15. U. Pal; F. Kimura; K. Roy & T. Pal. Recognition of English Multi-oriented Characters. International Conference on

Pattern Recognition (ICPR) , vol (2), pp. 873-876 , 2006.16. P.P. Roy; U. Pal & J. Llados. Multi-oriented character recognition from graphical documents. International Conference

on Cognition and Recognition (ICCR) , pp. 30-35 , 2008.17. U. Pal & P. P. Roy. Multi-oriented and curved text lines extraction from Indian documents. IEEE Transactions on

Systems, Man and Cybernetics- Part B, vol (34), pp. 1676-1684 , 2004. 18. P.K. Loo & and C.L. Tan. Word and Sentence Extraction Using Irregular Pyramid. Workshop on Document Analysis

System (DAS) , Lecture Notes in Computer Science (LNCS), vol (2423), pp. 307-318 , 2002. 19. H.C. Park; S.Y. Ok; Y.J. Yu & H.G. Cho. Word Extraction in Text/Graphic Mixed Image Using 3-Dimensional Graph

Model. International Journal on Document Analysis and Recognition (IJDAR) , vol (4), pp. 115 130 , 2001. 20. H. Goto & H. Aso. Extracting curved text lines using local linearity of the text line. International Journal on Document

Analysis and Recognition (IJDAR), vol (2), pp. 111-119 , 1999. 21. C.L. Tan & P.O. Ng. Text extraction using pyramid. Pattern Recognition (PR), vol (31), pp. 63-72 , 1998. 22. S. He, N. Abe & C. L. Tan. A clustering-based approach to the separation of text strings from mixed text/graphics

documents. International Conference on Pattern Recognition (ICPR) , pp. 706-710 , 1996. 23. M. Burge & G. Monagan. Extracting Words and Multi Part Symbols in Graphics Rich Documents. International

Conference on Image Analysis and Processing (ICIAP) , 1995. 24. M. Deseilligny; H. Le Men & G. Stamon. Characters string recognition on maps, a method for high level

reconstruction. International Conference on Document Analysis and Recognition (ICDAR) , pp. 249 252 , 1995. 25. E. Valveny; S. Tabbone; O. Ramos & E. Philippot. Performance Characterization of Shape Descriptors for Symbol

Representation. Workshop on Graphics Recognition (GREC) , 2007. 26. M. Delalandre; T. Pridmore; E. Valveny; E. Trupin & H. Locteau. Building Synthetic Graphical Documents for

Performance Evaluation. Workshop on Graphics Recognition (GREC) , Lecture Note in Computer Science (LNCS), vol (5046), pp. 288-298 , 2008.

generation of synthetic datasets for performance evaluation of text/graphics document ocr mathieu...

Documents

textline detection4

whitebox evaluation

character recognition3

ocr output

character recognitiondatasets

ocr system skewing

boxorientation scale

single character images