document content analysis for digital archives

10
Document Content Analysis for Digital Archives Eric Saund Perceptual Document Analysis Area Intelligent Systems Laboratory Palo Alto Research Center

Upload: ramona

Post on 15-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Document Content Analysis for Digital Archives. Eric Saund Perceptual Document Analysis Area Intelligent Systems Laboratory Palo Alto Research Center. Digital Archives. Index. Metadata layer. Content layer. Tasks. Operations. -browse by topic, type, etc. -search for known items - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Document Content Analysis for Digital Archives

Document Content Analysisfor Digital Archives

Eric SaundPerceptual Document Analysis Area

Intelligent Systems Laboratory Palo Alto Research Center

Page 2: Document Content Analysis for Digital Archives

Digital Archives

Tasks Operations

-casual browsing-look up information-follow trails-compose narratives-form and organize collections-distribute -assemble timelines

-browse by topic, type, etc.-search for known items-search for items meeting criteria-find duplicate items-find similar items-follow links-establish links-apply logical rules-edit metadata

All enabled by Metadata

Content layer

Metadata layer

Index

Page 3: Document Content Analysis for Digital Archives

Metadata

Two major problems with metadata:

1. Extracting metadata from raw content items.

2. Metadata is always incomplete for some purposes.

Title: Sarix neobDate: 37-23-55Media: niobiumFormat: jnbAuthor: Rsi LiwerText: “aliirn xeca sarlia isyb...”Index ID: 34962s

pointer to item

Metadata as a static record

computeSimilarityTo()containsEntity?()fitsSlotInModel?();extractTextAfterImageCleanup()

Metadata as an interface

functions applied to item content

Automatic Content Analysis

Page 4: Document Content Analysis for Digital Archives

State of the Art

• document image analysis

• photographic image analysis

• video/film analysis

• audio analysis

• web site analysis

text

appearance, layout

whowhatwherewhen

topicsentitites

genrecategoryfunctional roles

genresceneswho, what, ...

genrespeech/musicspeaker IDtransciption

Page 5: Document Content Analysis for Digital Archives

APR 21 2004 17:38 FR ---- 203 749 4519 TO 4264 P.02/06 * 9STCapitalModularSpace SALE INVOICE _ jz5| g'" ni'idspace.com -IPage: 1 FAX TO:_ BILL TO: REMIT TO: ACCOUNT NO.: ;m11 GE Capital Corp10 Riverview Drive Danbury, CT 06810 PO NUMBER: per Chad LOCATION OF UNITS: SAME AS ABOVE UNIT NO.: 076613 SERIAL NO.: SM069A 26,351.00 4,;- UNIT NO.: 076614 SERIAL NO.: SM0G9B 26, 351. 00 DOWN PAYMENT 0. 00 BUILDING DELIVERY0. 00 BUILDING DELIVERY 400.00 BLOCK AND LEVEL 0. 00 BLOCK AND LEVEL 2,100.00ANCHOR/TIE DOWN 780 00 DECKING 950. 00 / ELECTRICAL 1, 350. 00 / PLUMBING3, 025. 00 INSTALLATION SITE MANAGEMENT 1,100 00 SKIRTING- VINYL 1,360. 00TOTAL DUE THIS INVOICE 63,767.00

When OCR Works...

Page 6: Document Content Analysis for Digital Archives

Headeralignment

Graphical

logo

Font / Layout /Symbol Patternof Fax ID Line

RedactingmarkingsAddress

block

Repeatedelements

Hand-drawngraphical annotation

Handwritten Textual Annotation

Textual FieldIndicator

Tabular Layout

Graphic separator

ST

Amount Field

How People See a Document

CategoryType

Structural Elementsand Relations

RelationalContext

• Invoice • Construction project

• Supplier relationship

• Inventory & materials management

• Bill

• Itemized purchase listing

• Annotated document

Page 7: Document Content Analysis for Digital Archives

Technology Ecology

Academia Industry• Computer Vision• Document Recognition• Information Retrieval• Machine Learning• Speech Recognition• Natural Language• Artificial Intelligence

• Document Imaging• Transaction Processing• Workflow Systems• Database Vendors• Business Software• Business Process Outsourcing• Advertising/Search

Paying Customer:• government• industry

• businesses• consumers• government

Hobbiests

• museums• schools• local governments• NGOs• individuals• startups• boutique companies• shoestring projects in Academia and Industry

Characteristics:• science-based• toy problems• fragile

• engineering-based• robust• limited capabilities

Page 8: Document Content Analysis for Digital Archives

A Hobby Project

Document Capture Station

+ Collection Comprehension Engine

Wanted:

Page 9: Document Content Analysis for Digital Archives

Collection Comprehension Engine

OCR

308991

DocumentStructure Modeling

Document Collection Linking

Image Processing

Automatic Cataloging

Genre Tagging Clustering

Classification

Visualization GUI

Page 10: Document Content Analysis for Digital Archives

Conclusion

The hobby stage brings together kindred spirits.