prénom nom document analysis: introduction prof. rolf ingold, university of fribourg master course,...

Prénom Nom

Document Analysis:Introduction

Prof. Rolf Ingold, University of Fribourg

Master course, spring semester 2008

© Prof. Rolf Ingold

2

Outline

Introduction: definition and aims Applications overview Methodologies Possibility & limits Experience of the DIVA research group Course content and structure


3

What is a document ?

Data = abstract binary representation of any kind of information to be stored, transmitted or processed by computers

Information = data associated with an implicit or explicit interpretation

Document = piece of information that can be perceived and interpreted by humans to be perceived documents have to be rendered

displayed projected on screens printed played on speakers …


4

Taxonomy of documents

Documents may be Synthetic (structured) or captured (unstructured) Static (non temporal, printable) or dynamic (temporal) Viewable, audible or tactile

Animation

Syntheticdata

Captureddata

Static documents Dynamic documents

AudioImages

Graphics

Text (printed)

Off-line handwriting

On-line handwriting

Off-line handwriting Video Audio

Speech (synthetic)


5

What is document analysis ?

Document analysis aims of extracting symbolic information text (words, expressions, continuous text) graphics (vector graphics, shapes, symbols) layout structures logical structures numeric data writer / speaker identities, ...

from different captured sources images (scanned, camera based, synthesized) video on-line handwriting sound


6

Importance of document structures

Document = Content + Structures

Structures convey abstract high level information

They are revealed by styles


7

Structural document analysis

<DOCUMENT id=… >

<BODY>

<TITLE>A Master / Slave

Monitor … Network </TITLE>

<AUTHORS>

<AUTH>D.Jacobson</AUTH>

…

<AUTH>M. Shafiq</AUTH>

</AUTHORS>

<ABSTRACT><P>M…

Document analysis = Image Analysis of static documents to extract content and structures

Document analysis is applicable on captured images (from scanner, camera) synthetic images of electronic documents, available in

unstructured or purely structured form


8

Analysis of Electronic Documents

Most electronic documents are unstructured or poorly structured Document understanding can be seen as a reverse-engineering

task using a fixed-layout document format (such as PDF or XPS) as a pivot format

ASCII


9

Visual Audio Processing Chain

Visual Audio aims at recovering sound from old records by image analysis


10

Usefulness of document analysis

Extracting information from captured documents is useful in different contexts to avoid cumbersome keyboarding to capture information remotely to study the document’s content to categorize, classify and index digitized documents

for digital libraries culture preservation

to reuse document chunks to reedit and restyle an existing document to extract information for integrated applications

office automation database management information systems

to perform multimodal alignment


11

Typical applications of document analysis

Commercial products are available for Text reading (OCR products) Office automation (mail reading and dispatching) Form Processing (for dedicated applications)

More Specialized products Postal address reading Check reading and processing


12

Form processing

Performance of form processing depends on form complexity on form variability

Fields are located easily if their positions are fixed when using different colors

Content recognition is hard for several reasons degraded images approximate positioning of

symbols variability of handwriting


13

Check reading

Check reading can be automated at >90% difficulties: textured background, variability of writing easiness: fixed vocabulary, redundancy (legal & courtesy

amount), availability of contextual information (client database)

Legal Amount

Payee name

MICR

Date

Courtesy Amount

Signature

from <www.a2ia.com>


14

Table of contents recognition

Aim to extract information from TOC to index journals associate titles and

authors to page numbers

Advantages Very precise goal Regular layout for a given

jounal

Difficuties Complex layout Great variability when

considering journals universally


15

Analysis of historical documents

Aim to extract information to index historical documents

Challenges degradations irregular layout rich typography,

ornaments old scripts (no OCR)

Possible approach word spotting


16

Logical & physical document structures

Logical document structures Reflecting the author’s point of view Independent of presentation Composed of application dependent logical entities

Chapters, sections Specific to the application and document class

Physical document structures Reflects the editor’s point of view Composed of a hierarchy of physical entities

Text blocs, text lines and tokens Graphical primitives

Universal and independent of the document class


17

Document processing cycle

Physical Document

Logical Document

Paper Document

DocumentImage

Formatting Printing

Analysis and Recognition Digitizing

Document analysis can be considered as the reverse of formatting

Rendering


18

Relation between logical and physical structure

analysis

formatting

StylesLogical

StructurePhysical Structure

editprint

display

Document formatting is straightforward ... But document analysis is a non trivial task that generally can not be

fully automated


19

Processing chain

Blocs

Image

Simple text

Preprocessing

Postanalysis

OCR

Segmentation

Fonts

OFR

Doc understand. Structured docum.

Layout analysis


20

Pre-processing

Pre-processing aims at preparing the document image for further analysis; it includes Brightness / contrast enhancement Noise removal Skew / aberration correction Binarization / color clustering Shape smoothing


21

Segmentation

Document segmentation aims at splitting the image in regions of interests; it includes Page segmentation into blocs Text, graphics and images separation

Hairlines and frames detection Text bloc segmentation into text lines, words and characters In form processing, field separation Graphics segmentation into vectors and symbols


22

Optical Character Recognition (OCR)

OCR aims at extracting character codes (ASCII) from text images;

OCR was one of the earliest computer vision application Early patents were deposited in the 1910s, 30 years before

computer age !

OCR deals with many situations Isolated characters vs. complete words or phrases Different character classes (digits, uppercase letters, full text, …) Restricted or open vocabulary Machine printed vs. handwritten text Different languages (with various diacritics) and different scripts

(Latin, Greek, Hebrew, Arabic, Farsi, various Asian scripts, …,) Imperfect image quality (low resolution, textured background,

distortions, noise, …)


23

Text recognition related problems

Text analysis must also consider other aspects

In case of printed text Font recognition (family, size and style) Font categorization (with/without serifs, fixed vs.

proportional font)

In case of handwritten text Scriber identification or verification Scriber classification


24

Layout analysis

Layout analysis aims at extracting physical structures of documents; it consists of locating, delimiting and identifying

text blocks graphics tables formulas handwritten text fields annotations

associating figures and captions locating and delimiting headers and footers recovering the reading order (of multicolumn documents)


25

Example : layout modeling of scientific journals


26

Optical Font Recognition (OFR)

OFR aims at identifying the used fonts OFR is useful

for improving OCR accuracy, by using dedicated classifiers to distinguish “O” and “0”, “I” and “1”, …

for assigning logical labels, for logical structure recognition

Two strategies may be applied for OFR A priori OFR (without considering the content) A posteriori OFR (when the content is supposed to be known)


27

Document structure recognition

Document structure recognition (also referred to as document understanding) is the first step towards document interpretation

Document understanding is dealing with Logical labeling Logical structure recognition

Two levels of granularity are being considered macro-structure analysis labeling paragraphs / blocks micro-structure analysis labeling words / strings

Document structure recognition is still considered as an open issue There is no universal approach Solutions exist for dedicated document classes (museum

notices, checks, table of contents, scientific papers, newspapers, …


28

Two Levels of Structural Document Analysis

Physical structure analysis (also layout analysis) to locate and identify text block, graphics, tables, formulas,

handwritten text fields, annotations, … to recover the reading order

Logical structure analysis (also document understanding) to assign a hierarchy of logical labels first step towards interpretation


29

Use Case: Intelligent Newspaper Indexing

Full text indexing is not adequate for complex documents

Following items have to be identified headlines editorial articles (with title, author & function,

summary, content, links, ...) captions (associated to images) reader’s letters advertisement ...


30

Use case: Understanding Museum Notices

Group Vedette:

Area Title:Principal Title:

End of the title:

Area Address / Date:

Address:Date:

Area Collection:

Group Cote:from A. BelaïdLORIA-CNRS Nancy

Group Vedette:


End of the title:


Address:Date:

Area Collection:

Group Cote:

Group Vedette:


End of the title:


Address:Date:

Area Collection:

Group Cote:


31

Possibilities and limits of DA

Layout analysis is considered as almost solved for printed documents It can be achieved generically Problems remain for textured backgrounds and degraded

documents (historical & handwritten documents)

Document understanding is much less mature Solutions are application dependent Application of specific knowledge is needed (document models)


32

Need for Document Recognition Models

There is no universal approach !

Document recognition systems must be tuned for specific applications for specific document classes

Contextual information is required Models provide information like

generic document structures (DTD or XML-schema) geometrical and typographical attributes (style information) semantic information (keywords, dictionaries, databases, ...) statistical information


33

Content of document models

Generic structure Document Type Definition (DTD) or XML-schema

Style information Absolute or relative positioning Typographical attributes & formatting rules

Semantics (if available) Linguistic information, keywords Application specific ontology

Probabilistic information Frequencies of items or sequences, co-occurrences


34

Trouble with document models

Document models are hard to produce and to maintain implicit models (hard coded in the application) => hard to modify, adapt, extend explicit models, written in a formal language => cumbersome to produce, needs high expertise abstract models, learned automatically => needs a lot of training data (with ground-truth!)

Need for more flexible tools: assisted environments with friendly user interfaces recognition improving with use models are learned incrementally


35

Pattern Based Document Understanding (2-CREM) [Robaday 03]

Configurations consist of Set of vertices

Labeled (type) Attributed (pos, typo, ...)

Edges between vertices Labeled (neighborhood

relation) Attributed (geom, ...)

Model consists of Extraction rules For each class

Attribute selector List of pattern

extraction

configura-tion

model

classification

document image

rules

patt.

sele

cto

r

id


36

Performance evaluation

Performance evaluation is an important issue to compare algorithms to estimate corrections costs of real applications

Groundtruthed databases are required cost reduction by document analysis tools (bootstrap) synthetic data as alternative


37

List of Lessons

1. Introduction to document analysis and recognition

2. Document image processing

3. Fundamentals of pattern recognition I

4. Fundamentals of pattern recognition II

5. Printed text recognition

6. Font recognition

7. Layout analysis and segmentation

8. Logical structure analysis

9. Graphics recognition

10.Handwriting recognition

11.Reverse engineering of documents

12.Multimodal applications


38

Conclusion on document analysis

Document analysis is useful for many applications Commercial systems solve some of them

Advanced document analysis prototypes are developed in many research labs over the world

No universal documentation system is on the way

User assisted approaches may be a good trade-off for midsize applications

Structural document analysis will not disappear with exclusive electronic document handling (paperless office)


39

Organization of the course

Professor : Rolf Ingold, <[email protected]> Pérolles-2, B421, 026 300 84 66

Assistant : Jean-Luc Bloechle, <[email protected]>, Pérolles-2, B440, 026 300 92 94

Course : Tuesday, 09:15-10:00 & 10:15-11:00 Exercise : Wednesday, 11:15-12:00

requirements: 2/3 of series returned, 1/2 considered satisfactory Home work : estimated to 4-6 hours a week Website : http://diuf.unifr.ch/diva/web/ Examination :

oral, 20 minutes (alternatively written, 120 min) after spring semester (June 2008) or

summer (August-September 2008) Credits : 5ECTS

prénom nom document analysis: introduction prof. rolf ingold, university of fribourg master course,...

Documents

rolf ingold

structures document

structural document

document chunks

existing document

image analysis slide

captured documents

documents content