infty — an integrated ocr system for mathematical · pdf fileinfty — an integrated...

Download INFTY — An integrated OCR system for mathematical · PDF fileINFTY — An integrated OCR system for mathematical documents ... An integrated OCR system for mathematical documents,

If you can't read please download the document

Upload: vocong

Post on 06-Feb-2018

221 views

Category:

Documents


4 download

TRANSCRIPT

  • INFTY An integrated OCR system formathematical documents

    Masakazu Suzuki1), Fumikazu Tamari2), Ryoji Fukuda3),

    Seiichi Uchida4), Toshihiro Kanahori5)

    1) Faculty of Mathematics, Kyushu University, Japan

    2) Department of Information Education, Fukuoka University of Education, Japan

    3) Department of Human Welfare Engineering, Oita University, Japan

    4) Faculty of Information Science and Electrical Engineering, Kyushu University, Japan

    5) Research Center on Educational Media, Tsukuba College of Technology, Japan

    [email protected]

    ABSTRACTAn integrated OCR system for mathematical documents,called INFTY, is presented. INFTY consists of four pro-cedures, i.e., layout analysis, character recognition, struc-ture analysis of mathematical expressions, and manual er-ror correction. In those procedures, several novel techniquesare utilized for better recognition performance. Experimen-tal results on about 500 pages of mathematical documentsshowed high character recognition rates on both mathemat-ical expressions and ordinary texts, and sufficient perfor-mance on the structure analysis of the mathematical ex-pressions.

    KeywordsMathematical OCR, character and symbol recognition, struc-ture analysis of mathematical expressions

    1. INTRODUCTIONOptical character reader (OCR) systems which can recog-

    nize not only ordinary texts but also mathematical expres-sions have been investigated [1]. The development of suchOCR provides the following merits.

    Storage size reduction: The OCR result of a documentrequires far less storage size than its original scannedimage.

    Search services: Various search services (e.g., keywordsearch, definition search, and theorem search) are avail-able across the mathematical documents.

    Format conversion: The OCR result can be providedin various document formats (e.g., XML, LaTeX, Math-ematica notebook, and braille).

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.DocEng 03, Grenoble, FranceCopyright 2003 ACM X-XXXXX-XX-X/XX/XX ...$5.00.

    Especially, the OCR for mathematical documents is indis-pensable on digitizing numerous historical mathematical doc-uments for digital library [2, 3].

    In this paper, an integrated OCR system for mathematicaldocuments, called INFTY, is presented. Figure 1 shows asnapshot of INFTY on a PC. INFTY reads scanned page im-ages of a mathematical document and provides their charac-ter recognition results. Since INFTY analyzes the structuresof mathematical expressions in the document, INFTY canproduce its recognition result in the LaTeX format (Fig. 2)and other math-description formats . Figure 3 shows thediagram of INFTY, which consists of four procedures, i.e.,1 layout analysis, 2 character recognition, 3 structureanalysis of mathematical expressions, and 4 manual errorcorrection.

    Novel and distinctive features of INFTY are summarizedas follows.

    The character recognition procedure of INFTY con-sists of two independent and complementary recogni-tion engines; one is a commercial OCR engine not spe-cialized for mathematical documents and the other isa character recognition engine originally developed formathematical symbols.

    The separation of ordinary text parts and mathemati-cal expression parts is performed in the character recog-nition procedure while utilizing recognition results.

    The structure analysis procedure is based on an op-timization framework and therefore stable against toboth recognition errors and ambiguity in the mathe-matical expressions.

    A clustering technique is incorporated for higher accu-racy and efficiency.

    The rest of this paper is organized as follows. In Section2,3,4, and 5, the details of above four procedures (Fig. 31 4 ) are described, respectively. In those description, themerits of above features are emphasized. Then, in Section6, the performance of INFTY is evaluated qualitatively andquantitatively through experimental results on about 500pages of mathematical documents.

  • recognition result frame

    scanned image framethumbnails frame

    Figure 1: Snapshot of INFTY.

    the functions $\sigma_a(r)={\displaystyle \int}_{ ||z-a||\leq r} \sigma$ and $\nu_a(r)={\displaystyle \int}_{||z-a||\leq r}\nu_a$. Bothare positive increasing functions of $r$. Then

    Figure 2: (Upper) Input to INFTY. (Lower) Outputof INFTY in LaTeX format.

    2. LAYOUT ANALYSISIn the layout analysis procedure (Fig. 31 ), which is

    the first procedure of INFTY, several preprocessing opera-tions, such as binarization, noise removal, and deskewing,are performed on the page images (scanned in 600dpi) of amathematical document.

    After all connected components are extracted from thepreprocessed page image, the page image are separated intofigure / table areas and non-figure areas. One of the maincriteria used in this separation is the size of the connected

    components. For example, the area with large connectedcomponents will be judged as a figure / table area. Notethat big symbols, such as root symbols, big parentheses,etc., are ignored in this separation process by some specialtreatments.

    The non-figure area is further decomposed into text lines.On non-mathematical documents, each text line is simplyextracted by searching for the periodical local minima onthe horizontal projection histogram of the page image. Onmathematical documents, however, this strategy is not expe-dient; the heights of mathematical expressions are very vari-able and therefore the horizontal projection histogram areoften irregular around the mathematical expressions. Ourstrategy is similar to Kacem et al.[4], where connected com-ponents in a certain neighborhood are concatenated to builda text line.

    3. CHARACTER RECOGNITIONThe character recognition procedure (Fig. 32 ), which is

    the second procedure in INFTY, plays two important roles.The first role is the separation of each text line into mathe-matical expressions (e.g., 2, P (a) =

    a p(x)dx) and

    ordinary texts (e.g., Theorem, defined). The secondrole is the character recognition for both ordinary texts (e.g.,

  • separation of figure/table areas

    initial character recognition and math-text separation

    extraction of lines in non-figure areas

    non-figure areas(=ordinary text + math expression)

    connected componentson each line

    extraction of connected components

    ordinary texts with recog. result

    math-symbolswith recog. results

    structure ofmath. expression

    scanned mathematical document image

    preprocessing

    connected componentsof each page

    figure/table areas

    manual error correction

    + non-mathstructures

    recog resultsof math. symbols

    structure analysis of math expressions

    math. symbolswith recog. results

    recog resultsof charactersin ordinary text

    2

    1

    3

    4

    structure ofmath. expression

    XML LaTeX braille etc..

    automatic correction of recognition results using clustering

    math-symbolswith recog. results

    ordinary texts with recog. result

    Section 2

    Section 3Section 3.1

    Section 3.2

    Section 4

    Section 5

    Figure 3: Diagram of INFTY. INFTY consists offour procedures (surrounded by dashed lines), i.e.,1 layout analysis, 2 character recognition, 3 struc-ture analysis of mathematical expressions, and 4manual error correction.

    a, A) and mathematical expressions (e.g., a, A,

    , , ().As shown in Fig. 32 , the character recognition pro-

    cedure for those two roles consists of two sub-procedures,i.e., (i) initial character recognition and math-text separa-tion and (ii) automatic correction of recognition results usingclustering. The first sub-procedure is incorporated to pro-vide math-text separation results as well as initial character

    recognition results. The second sub-procedure is incorpo-rated to improve the recognition accuracy by reducing mis-recognitions due to slight shape difference. In the following,the details of each sub-procedure are described.

    3.1 Initial character recognition andmath-textseparation

    Figure 4 illustrates the detail of the sub-procedure forinitial character recognition and math-text separation. Thissub-procedure has two features. One feature is that thissub-procedure is that the math-text separation is performedwhile utilizing the result of character recognition. That is,the character recognition and the math-text separation areperformed simultaneously and cooperatively. The other fea-ture is that two complementary recognition engines, a com-mercial OCR engine for ordinary texts and an original recog-nition engine for mathematical expressions, are used in atwo-step manner.

    (a) Recognition by commercial OCR engine

    The connected components on a text line is firstly sub-jected to a commercial OCR engine. When the text lineonly contains ordinary texts, this OCR engine will produceits good recognition result utilizing a rich lexicon. However,when the text line contains mathematical expressions, theOCR engine will fail (due to italic fonts, sub-/super-scripts,mathematical symbols, etc.) and may produce some mean-ingless string (e.g., y?zs in Fig. 4) as the recognition re-sult. In INFTY, this failure is exploited for initial math-textseparation. Namely, the connected components recognizedas such a meaningless string are selected as the connectedcomponents in the mathematical expressions.

    (b) Verification based on position and