![Page 1: Presentation of Claus Gravenhorst, BnF Information Day](https://reader036.vdocuments.net/reader036/viewer/2022081401/559f24bb1a28ab48578b4683/html5/thumbnails/1.jpg)
Optical Layout Recognition (OLR)
From unstructured to structured newspaper data
Claus Gravenhorst, CCS Content Conversion Specialists GmbH
ENP information day, Paris, November 27, 2014
![Page 2: Presentation of Claus Gravenhorst, BnF Information Day](https://reader036.vdocuments.net/reader036/viewer/2022081401/559f24bb1a28ab48578b4683/html5/thumbnails/2.jpg)
Agenda
• About CCS
• General OLR-workflow for mass digitization
• Layout and structure analysis
• ENP OLR workflow
• Quality assurance
• Output – METS/ALTO package
• Use of structural data – Access and presentation
![Page 3: Presentation of Claus Gravenhorst, BnF Information Day](https://reader036.vdocuments.net/reader036/viewer/2022081401/559f24bb1a28ab48578b4683/html5/thumbnails/3.jpg)
About CCS
• CCS Content Conversion Specialists GmbH (Hamburg), as technical project partner, will provide its expertise and docWorks technology to set up and operate a mass digitization workflow for creating high quality structured content from 2 million scanned newspaper pages provided by 5 library partners
• Page volume:
BNF=1.000 k, NLE=500 k , SUB HH=480 k, NLF=90 k, SBB=10 k
• The distributed OLR workflow enables the contribution of project partners (content providers) to the integrated quality assurance process
• CCS is also contributing to the specification of the ENMAP metadata model
![Page 4: Presentation of Claus Gravenhorst, BnF Information Day](https://reader036.vdocuments.net/reader036/viewer/2022081401/559f24bb1a28ab48578b4683/html5/thumbnails/4.jpg)
General workflow for mass digitization
Re-Scan
Conversion
Imaging
Layout Analysis
OCR
ISR
Reject Condition
DeliveryQA
random
Final Output
ScanningImage
Metadata
Database
----------------
Repository
• Automated QA
DocumentUID
Barcode
Item Tracking
Manual QA
•in-house•near-shore•off-shore•multiple locations
Manual QA
•in-house•near-shore
Check in
Check out
Scanner
•Robot-
•Book-
•Document-
•Microfilm-
QA+CorrectionQA+Correcti
onQA +
Correction
Z 39.50Metadata
![Page 5: Presentation of Claus Gravenhorst, BnF Information Day](https://reader036.vdocuments.net/reader036/viewer/2022081401/559f24bb1a28ab48578b4683/html5/thumbnails/5.jpg)
Layout and structure analysis
• Layout analysis based on „bottom up“ approach
• General rule system enables recognition of words, text lines, text blocks, columns and classification of text blocks, illustrations, advertisements, tables and the following page types:
- title page (the title page of an issue) - content page (a page that consists of content/text only) - illustration page (a page that has at least one illustration) - advertisement page (a page that contains adverts only)
• Structure analysis through classification of headlines and grouping of zones into articles
(incl. article continuation)
![Page 6: Presentation of Claus Gravenhorst, BnF Information Day](https://reader036.vdocuments.net/reader036/viewer/2022081401/559f24bb1a28ab48578b4683/html5/thumbnails/6.jpg)
ENP OLR workflow | Conversion without scanning
•Digital Image•MetadataDelivery
•Digital Image•MetadataDelivery
•Digital ObjectReturn
•Digital ObjectReturn
Inspection / Automatic QAInspection /
Automatic QA
•Doc Delivery•Doc Delivery
RejectReject
Conversion facility
Material location
Conversion
MD Recording
![Page 7: Presentation of Claus Gravenhorst, BnF Information Day](https://reader036.vdocuments.net/reader036/viewer/2022081401/559f24bb1a28ab48578b4683/html5/thumbnails/7.jpg)
Possible conversion scenarios
A) Conversion at library (on-site)
B) Conversion off-shore at CCS data center,final QA at the library via internet transfer (remote QA solution)
C) Conversion off-shore at CCS,final QA at the library by backup shipment
![Page 8: Presentation of Claus Gravenhorst, BnF Information Day](https://reader036.vdocuments.net/reader036/viewer/2022081401/559f24bb1a28ab48578b4683/html5/thumbnails/8.jpg)
Scenario B | Remote QA at library
Internet
StorageStorage
IN
OUTPOOL
dW Share
Master
OffshoreProcessing
@ CCS
OUTPUT
METS ALTO
StorageStorage
POOL
dW Share
RQA
QA on-site @ Library
INPUT
![Page 9: Presentation of Claus Gravenhorst, BnF Information Day](https://reader036.vdocuments.net/reader036/viewer/2022081401/559f24bb1a28ab48578b4683/html5/thumbnails/9.jpg)
Quality assurance
• @ CCS | Automated markup and basic manual correction: - Headlines, illustrations, tables, captions, advertisements, etc. - Article segmentation and grouping of zones into articles (incl. continuation)
• @ Content Provider (Library)
Recommended: - Zoning: correct classification of blocks as „text“ or „illustration“ - Article segmentation: correct identification of headlines/text blocks/captions - Grouping: correct grouping of blocks (text, illustration) to articles - Metadata: correct title, issue date and issue number
Optional: - Page types: correct page types - Page numbers: correct page sequence - OCR: perform text correction of specific zones (e.g. headlines, captions)
![Page 10: Presentation of Claus Gravenhorst, BnF Information Day](https://reader036.vdocuments.net/reader036/viewer/2022081401/559f24bb1a28ab48578b4683/html5/thumbnails/10.jpg)
Output | METS/ALTO package
• METS/ALTO metadata schemas to describe the structured digital ouput object
• A newspaper issue processed in docWorks is converted into one METS XML file. It reflects the whole physical and logical structure, manages all links to the image files and the related ALTO XML files. ALTO is based on a standardized page description schema and contains all information of a page (print space, margins, coordinates, OCR results).
• Benefits of structural markup:
- better browsing and more precise text search
- better access and display on tablet and mobile devices - automated article classification and clustering through data/text mining and linguistic technologies - user engagement for manual online text correction, article classification, annotation, building personal collections, etc. - sharing articles via social media platforms like Facebook, Twitter, etc. _______________ METS = Metadada Encoding and Transmission Standard
ALTO = Analyzed Layout and Text Object
![Page 11: Presentation of Claus Gravenhorst, BnF Information Day](https://reader036.vdocuments.net/reader036/viewer/2022081401/559f24bb1a28ab48578b4683/html5/thumbnails/11.jpg)
Access and Presentation (I)
• Sample presentation system (Veridian)
• Browse by date, title
• Text search
• Article hit list
• Word highlighting
![Page 12: Presentation of Claus Gravenhorst, BnF Information Day](https://reader036.vdocuments.net/reader036/viewer/2022081401/559f24bb1a28ab48578b4683/html5/thumbnails/12.jpg)
Access and Presentation (II)
• Issue
• Table of contents
![Page 13: Presentation of Claus Gravenhorst, BnF Information Day](https://reader036.vdocuments.net/reader036/viewer/2022081401/559f24bb1a28ab48578b4683/html5/thumbnails/13.jpg)
Access and Presentation (III)
• Text & image view
• User text correction
• Article clipping
• Print article
• Distribute via email and social media platforms
![Page 14: Presentation of Claus Gravenhorst, BnF Information Day](https://reader036.vdocuments.net/reader036/viewer/2022081401/559f24bb1a28ab48578b4683/html5/thumbnails/14.jpg)
Thank you for your attention!
www.europeana-newspapers.eu