prénom nom document analysis: introduction prof. rolf ingold, university of fribourg master course,...
Post on 20-Dec-2015
217 views
TRANSCRIPT
Prénom Nom
Document Analysis:Introduction
Prof. Rolf Ingold, University of Fribourg
Master course, spring semester 2008
© Prof. Rolf Ingold
2
Outline
Introduction: definition and aims Applications overview Methodologies Possibility & limits Experience of the DIVA research group Course content and structure
© Prof. Rolf Ingold
3
What is a document ?
Data = abstract binary representation of any kind of information to be stored, transmitted or processed by computers
Information = data associated with an implicit or explicit interpretation
Document = piece of information that can be perceived and interpreted by humans to be perceived documents have to be rendered
displayed projected on screens printed played on speakers …
© Prof. Rolf Ingold
4
Taxonomy of documents
Documents may be Synthetic (structured) or captured (unstructured) Static (non temporal, printable) or dynamic (temporal) Viewable, audible or tactile
Animation
Syntheticdata
Captureddata
Static documents Dynamic documents
AudioImages
Graphics
Text (printed)
Off-line handwriting
On-line handwriting
Off-line handwriting Video Audio
Speech (synthetic)
© Prof. Rolf Ingold
5
What is document analysis ?
Document analysis aims of extracting symbolic information text (words, expressions, continuous text) graphics (vector graphics, shapes, symbols) layout structures logical structures numeric data writer / speaker identities, ...
from different captured sources images (scanned, camera based, synthesized) video on-line handwriting sound
© Prof. Rolf Ingold
6
Importance of document structures
Document = Content + Structures
Structures convey abstract high level information
They are revealed by styles
© Prof. Rolf Ingold
7
Structural document analysis
<DOCUMENT id=… >
<BODY>
<TITLE>A Master / Slave
Monitor … Network </TITLE>
<AUTHORS>
<AUTH>D.Jacobson</AUTH>
…
<AUTH>M. Shafiq</AUTH>
</AUTHORS>
<ABSTRACT><P>M…
Document analysis = Image Analysis of static documents to extract content and structures
Document analysis is applicable on captured images (from scanner, camera) synthetic images of electronic documents, available in
unstructured or purely structured form
© Prof. Rolf Ingold
8
Analysis of Electronic Documents
Most electronic documents are unstructured or poorly structured Document understanding can be seen as a reverse-engineering
task using a fixed-layout document format (such as PDF or XPS) as a pivot format
ASCII
© Prof. Rolf Ingold
9
Visual Audio Processing Chain
Visual Audio aims at recovering sound from old records by image analysis
© Prof. Rolf Ingold
10
Usefulness of document analysis
Extracting information from captured documents is useful in different contexts to avoid cumbersome keyboarding to capture information remotely to study the document’s content to categorize, classify and index digitized documents
for digital libraries culture preservation
to reuse document chunks to reedit and restyle an existing document to extract information for integrated applications
office automation database management information systems
to perform multimodal alignment
© Prof. Rolf Ingold
11
Typical applications of document analysis
Commercial products are available for Text reading (OCR products) Office automation (mail reading and dispatching) Form Processing (for dedicated applications)
More Specialized products Postal address reading Check reading and processing
© Prof. Rolf Ingold
12
Form processing
Performance of form processing depends on form complexity on form variability
Fields are located easily if their positions are fixed when using different colors
Content recognition is hard for several reasons degraded images approximate positioning of
symbols variability of handwriting
© Prof. Rolf Ingold
13
Check reading
Check reading can be automated at >90% difficulties: textured background, variability of writing easiness: fixed vocabulary, redundancy (legal & courtesy
amount), availability of contextual information (client database)
Legal Amount
Payee name
MICR
Date
Courtesy Amount
Signature
from <www.a2ia.com>
© Prof. Rolf Ingold
14
Table of contents recognition
Aim to extract information from TOC to index journals associate titles and
authors to page numbers
Advantages Very precise goal Regular layout for a given
jounal
Difficuties Complex layout Great variability when
considering journals universally
© Prof. Rolf Ingold
15
Analysis of historical documents
Aim to extract information to index historical documents
Challenges degradations irregular layout rich typography,
ornaments old scripts (no OCR)
Possible approach word spotting
© Prof. Rolf Ingold
16
Logical & physical document structures
Logical document structures Reflecting the author’s point of view Independent of presentation Composed of application dependent logical entities
Chapters, sections Specific to the application and document class
Physical document structures Reflects the editor’s point of view Composed of a hierarchy of physical entities
Text blocs, text lines and tokens Graphical primitives
Universal and independent of the document class
© Prof. Rolf Ingold
17
Document processing cycle
Physical Document
Logical Document
Paper Document
DocumentImage
Formatting Printing
Analysis and Recognition Digitizing
Document analysis can be considered as the reverse of formatting
Rendering
© Prof. Rolf Ingold
18
Relation between logical and physical structure
analysis
formatting
StylesLogical
StructurePhysical Structure
editprint
display
Document formatting is straightforward ... But document analysis is a non trivial task that generally can not be
fully automated
© Prof. Rolf Ingold
19
Processing chain
Blocs
Image
Simple text
Preprocessing
Postanalysis
OCR
Segmentation
Fonts
OFR
Doc understand. Structured docum.
Layout analysis
© Prof. Rolf Ingold
20
Pre-processing
Pre-processing aims at preparing the document image for further analysis; it includes Brightness / contrast enhancement Noise removal Skew / aberration correction Binarization / color clustering Shape smoothing
© Prof. Rolf Ingold
21
Segmentation
Document segmentation aims at splitting the image in regions of interests; it includes Page segmentation into blocs Text, graphics and images separation
Hairlines and frames detection Text bloc segmentation into text lines, words and characters In form processing, field separation Graphics segmentation into vectors and symbols
© Prof. Rolf Ingold
22
Optical Character Recognition (OCR)
OCR aims at extracting character codes (ASCII) from text images;
OCR was one of the earliest computer vision application Early patents were deposited in the 1910s, 30 years before
computer age !
OCR deals with many situations Isolated characters vs. complete words or phrases Different character classes (digits, uppercase letters, full text, …) Restricted or open vocabulary Machine printed vs. handwritten text Different languages (with various diacritics) and different scripts
(Latin, Greek, Hebrew, Arabic, Farsi, various Asian scripts, …,) Imperfect image quality (low resolution, textured background,
distortions, noise, …)
© Prof. Rolf Ingold
23
Text recognition related problems
Text analysis must also consider other aspects
In case of printed text Font recognition (family, size and style) Font categorization (with/without serifs, fixed vs.
proportional font)
In case of handwritten text Scriber identification or verification Scriber classification
© Prof. Rolf Ingold
24
Layout analysis
Layout analysis aims at extracting physical structures of documents; it consists of locating, delimiting and identifying
text blocks graphics tables formulas handwritten text fields annotations
associating figures and captions locating and delimiting headers and footers recovering the reading order (of multicolumn documents)
© Prof. Rolf Ingold
26
Optical Font Recognition (OFR)
OFR aims at identifying the used fonts OFR is useful
for improving OCR accuracy, by using dedicated classifiers to distinguish “O” and “0”, “I” and “1”, …
for assigning logical labels, for logical structure recognition
Two strategies may be applied for OFR A priori OFR (without considering the content) A posteriori OFR (when the content is supposed to be known)
© Prof. Rolf Ingold
27
Document structure recognition
Document structure recognition (also referred to as document understanding) is the first step towards document interpretation
Document understanding is dealing with Logical labeling Logical structure recognition
Two levels of granularity are being considered macro-structure analysis labeling paragraphs / blocks micro-structure analysis labeling words / strings
Document structure recognition is still considered as an open issue There is no universal approach Solutions exist for dedicated document classes (museum
notices, checks, table of contents, scientific papers, newspapers, …
© Prof. Rolf Ingold
28
Two Levels of Structural Document Analysis
Physical structure analysis (also layout analysis) to locate and identify text block, graphics, tables, formulas,
handwritten text fields, annotations, … to recover the reading order
Logical structure analysis (also document understanding) to assign a hierarchy of logical labels first step towards interpretation
© Prof. Rolf Ingold
29
Use Case: Intelligent Newspaper Indexing
Full text indexing is not adequate for complex documents
Following items have to be identified headlines editorial articles (with title, author & function,
summary, content, links, ...) captions (associated to images) reader’s letters advertisement ...
© Prof. Rolf Ingold
30
Use case: Understanding Museum Notices
Group Vedette:
Area Title:Principal Title:
End of the title:
Area Address / Date:
Address:Date:
Area Collection:
Group Cote:from A. BelaïdLORIA-CNRS Nancy
Group Vedette:
Area Title:Principal Title:
End of the title:
Area Address / Date:
Address:Date:
Area Collection:
Group Cote:
Group Vedette:
Area Title:Principal Title:
End of the title:
Area Address / Date:
Address:Date:
Area Collection:
Group Cote:
© Prof. Rolf Ingold
31
Possibilities and limits of DA
Layout analysis is considered as almost solved for printed documents It can be achieved generically Problems remain for textured backgrounds and degraded
documents (historical & handwritten documents)
Document understanding is much less mature Solutions are application dependent Application of specific knowledge is needed (document models)
© Prof. Rolf Ingold
32
Need for Document Recognition Models
There is no universal approach !
Document recognition systems must be tuned for specific applications for specific document classes
Contextual information is required Models provide information like
generic document structures (DTD or XML-schema) geometrical and typographical attributes (style information) semantic information (keywords, dictionaries, databases, ...) statistical information
© Prof. Rolf Ingold
33
Content of document models
Generic structure Document Type Definition (DTD) or XML-schema
Style information Absolute or relative positioning Typographical attributes & formatting rules
Semantics (if available) Linguistic information, keywords Application specific ontology
Probabilistic information Frequencies of items or sequences, co-occurrences
© Prof. Rolf Ingold
34
Trouble with document models
Document models are hard to produce and to maintain implicit models (hard coded in the application) => hard to modify, adapt, extend explicit models, written in a formal language => cumbersome to produce, needs high expertise abstract models, learned automatically => needs a lot of training data (with ground-truth!)
Need for more flexible tools: assisted environments with friendly user interfaces recognition improving with use models are learned incrementally
© Prof. Rolf Ingold
35
Pattern Based Document Understanding (2-CREM) [Robaday 03]
Configurations consist of Set of vertices
Labeled (type) Attributed (pos, typo, ...)
Edges between vertices Labeled (neighborhood
relation) Attributed (geom, ...)
Model consists of Extraction rules For each class
Attribute selector List of pattern
extraction
configura-tion
model
classification
document image
rules
patt.
sele
cto
r
id
© Prof. Rolf Ingold
36
Performance evaluation
Performance evaluation is an important issue to compare algorithms to estimate corrections costs of real applications
Groundtruthed databases are required cost reduction by document analysis tools (bootstrap) synthetic data as alternative
© Prof. Rolf Ingold
37
List of Lessons
1. Introduction to document analysis and recognition
2. Document image processing
3. Fundamentals of pattern recognition I
4. Fundamentals of pattern recognition II
5. Printed text recognition
6. Font recognition
7. Layout analysis and segmentation
8. Logical structure analysis
9. Graphics recognition
10.Handwriting recognition
11.Reverse engineering of documents
12.Multimodal applications
© Prof. Rolf Ingold
38
Conclusion on document analysis
Document analysis is useful for many applications Commercial systems solve some of them
Advanced document analysis prototypes are developed in many research labs over the world
No universal documentation system is on the way
User assisted approaches may be a good trade-off for midsize applications
Structural document analysis will not disappear with exclusive electronic document handling (paperless office)
© Prof. Rolf Ingold
39
Organization of the course
Professor : Rolf Ingold, <[email protected]> Pérolles-2, B421, 026 300 84 66
Assistant : Jean-Luc Bloechle, <[email protected]>, Pérolles-2, B440, 026 300 92 94
Course : Tuesday, 09:15-10:00 & 10:15-11:00 Exercise : Wednesday, 11:15-12:00
requirements: 2/3 of series returned, 1/2 considered satisfactory Home work : estimated to 4-6 hours a week Website : http://diuf.unifr.ch/diva/web/ Examination :
oral, 20 minutes (alternatively written, 120 min) after spring semester (June 2008) or
summer (August-September 2008) Credits : 5ECTS