workshop ocr/ner - ku leuven · 2020-04-01 · workshop ocr/ner digital humanities summer school...
TRANSCRIPT
![Page 1: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/1.jpg)
Workshop
OCR/NER Digital Humanities Summer school
2015
![Page 2: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/2.jpg)
Agenda
• Introduction OCR
• Introduction NER
• Use case Succeed project @ KU Leuven
• Getting setup
• Hands-on session
![Page 3: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/3.jpg)
Who are we?
• INL (Institute for Dutch lexicology)
– Katrien Depuydt
– Jesse de Does
• LIBIS
– Roxanne Wyns
– Sam Alloing
![Page 4: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/4.jpg)
What is OCR?
• Definition
– Converting image of text to electronic text
• But entails a lot more!
– It is a workflow
• Recognition of printed text, not handwritten
text
![Page 5: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/5.jpg)
Why OCR? • Improve discoverability
– Search within the image
– Search across images
• Text processing – Named entity recognition
• See later
– Further analysis • TEI…
• …
![Page 6: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/6.jpg)
Why bad performance on some
text? • Quality of the printed text
– Can be a problem on historical material
• Different spacing between words,
characters,…
• Low quality of scans
• Font and language not supported
• Complex layout is not kept
![Page 7: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/7.jpg)
Workflow OCR Attestation Improving Executing
OCR Digitisation Pre- processing
Post- processing
Evaluation set
![Page 8: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/8.jpg)
Digisation
• Images in greyscale or black and white
– The OCR software will convert to B&W (=
binarisation)
• 300 dpi is recommended
– For smaller fonts 400 to 600 dpi
![Page 9: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/9.jpg)
Workflow OCR Attestation Improving Executing
OCR Digitisation Post- processing
Evaluation set Pre-
processing
![Page 10: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/10.jpg)
Pre-processing • All kinds of improvements
• Depends of the capabilities of your OCR engine
– Some engines contain some of the pre-processing features
• Layout correction
– De-skew
• Document Deskewer
• Scan Tailor
• Page Curl Corrector
• Removal tools
– Noise removal
– Border removal
• Scan Tailor
• NCSR Border Detection and Removal
• Image correction/enhancements
– Binarization
• Creation of Black and white images
– Image tools
• Imagemagick
• Photoshop
• Gimp
• …
De-skewing of image
Page curl
![Page 11: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/11.jpg)
Pre-processing: skew and wrap correction
![Page 12: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/12.jpg)
Pre-processing: Examples of noise
![Page 13: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/13.jpg)
Pre-processing: “Star Wars” journal…
![Page 14: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/14.jpg)
Workflow OCR Attestation Improving Executing
OCR Digitisation Pre- processing
Post- processing
Evaluation set
![Page 15: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/15.jpg)
Attestation • Create a ground truth or gold standard
= 100% correct transcription
• Compare the ‘truth’ to the OCR output
• Evaluate the OCR output, other uses include: – To test different OCR engines
– To test outsourced OCR
• Tool – Aletheia
• Creates PAGE XML ground truth
![Page 16: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/16.jpg)
OCR workflow Improving Executing
OCR Digitisation Pre- processing
Post- processing
Evaluation set Attestation
![Page 17: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/17.jpg)
OCR evaluation
• OCRevalUAtion • Page Evaluator
for Tesseract
• Determine the error rate
– CER = character
– WER = Word
• Compare OCR output against ground truth
![Page 18: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/18.jpg)
OCR workflow
Attestation Executing OCR Digitisation Pre-
processing
Post- processing
Evaluation set Improving
![Page 19: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/19.jpg)
Improving • Different techniques used to improve the OCR result
• Pattern training – Learn the OCR engine new characters
– Tools • Part of some OCR engine
• Franken+ for Tesseract
• Cutout and page generator for Tesseract
• Dictionaries – Built-in
– Custom • Tools to create dictionaries: CoBaLT
• Changing the settings of the OCR engine
• Add training data
![Page 20: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/20.jpg)
OCR workflow
Attestation Digitisation Pre- processing
Post- processing
Evaluation set Improving
Executing OCR
![Page 21: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/21.jpg)
OCR applications • Desktop applications
– GUI
– Processing page by page
– Some batch processing capabilities
– Easy to use
• OCR engines – No GUI
– Processing large amounts of images
– More OCR features and more fine-tuning
– More knowledge required to use
![Page 22: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/22.jpg)
2 types of OCR • Omnifont engine
• Adaptive engine – No knowledge of example font needed
– Creates a model during training • => More training required
• Examples – ABBYY FineReader
– Tesseract
– OmniPage
– OCRopus/ocropy
– BIT-Alpha
– IBM Adaptive OCR engine
![Page 23: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/23.jpg)
Actions of the OCR engine • See pre-processing
– Binarisation
• Layout analysis – Identify the regions
with text
– Tools: • Layout Evaluation Tool
• Segmentation – Line, character and word
• Text recognition – Charachter categorisation
![Page 24: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/24.jpg)
Output formats • Vendor specific XML
– Richest format
• Text
• ALTO – Library of Congress
– XML format describing the page
– Used in some viewer software, to overlay text on image
• TEI – Text Encoding Initiative
– Describe text in high detail
– XML format
– Don’t expect too much from OCR engine
• …
![Page 25: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/25.jpg)
OCR workflow
Attestation Improving Executing OCR Digitisation
Post- processing
Evaluation set Pre-
processing
![Page 26: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/26.jpg)
Post-correction • Manual or semi-automated
• Tools
– Korrektor
– Virtual Transcription Laboratory
– Page corrector for Tesseract
– CONCERT (IBM)
![Page 27: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project](https://reader034.vdocuments.net/reader034/viewer/2022042417/5f3281f838b6ce3eb103b72d/html5/thumbnails/27.jpg)
Conclusion
• Optimise for your use case
– Not all use cases need perfection
• Start with easy gains
– Dictionary
– Good images
• Evaluate!