ocr at bnf : history, production, projects, researchresearch_projects_on_ocr_at_the_bnf.pdf · ocr...
TRANSCRIPT
OCR at BnF : history, production, projects, research
Geneviève Cron• Bibliothèque Nationale de France
1
• 1992 today : • digitization tenders + internal programs
• 2005 today : • digitization + OCR-isation tenders• OCR-isation tenders
• 2010 today : • OCR + EPub• digitization + OCR + Epub
• Image quality• OCR output
BnF/OCR/Production/History
Acces
s
Search
Edit
BnF/OCR/Production/History
0
500 000
1 000 000
1 500 000
2 000 000
2 500 000
1980 1985 1990 1995 2000 2005 2010 2015 2020
DigitizedOCR-isedEpub
BnF/OCR/Production/Workflow
OCR
Selection Digitization OCR-ization
Control Eval
Selection
Bnf/OCR/Research/Master2010
OCR
OCR Rate
Prediction
> X %
YES
NO
XX
Litis +
Bnf/OCR/Research/PhD2010
OCR
OCR Rate
Prediction
> x % ?
YES
NO
X
Digitization
Bnf/OCR/Research/Digidoc
• Digidoc : – Optimizing digitization as a function of the use :
• Conservation• OCR• Communication
– New metadata format• with information gathered in the digitization process• including data from image processing
OCRization
OCR
BnF/OCR/Production/OCR Workflow
OCR
Manual correction
> X %
[0% ,100%]
[X% ,100%]
[0% , 60%]
[60% ,100%]
ImageRaw OCR
OCR HQ
Service provider
X on word ? On character ?
• For digitized images– Internally produced– Produced in the early projects where no OCR where made (<
2004)
• Segmentation and segmentation correction• OCR and OCR correction / control
• Goal : 3 000 000 pages / year (2012)
• Actual status : test, settings, tunning
BnF/OCR/Production/OCR_Inhousing
Bnf/OCR/Research/Quaero
– BnF • provides corpus • for training and evaluation
– Research and development topics• Automatic and robust segmentation • Grayscale OCR • Reading ordre• Recognition on Maps• Manuscrit OCR• Named Entity Recognition
BnF/OCR/Research/Impact/BnF’s Goals
• Share experience on past and ongoing OCR workflows over Europe
• Take part in OCR Improvement– Improve the quality of the OCR output– Lower the post correction– Enlarge the material to be submitted to OCR
with good quality expectation
Bnf/OCR/Impact/BnF’s_Involvement
• Requirements• Evaluation
– Corpus defining and ground truth production and control– Evaluation criterias
• Case studies– BW / Greyscale– Destructive / Non destructive– OCR / Double Keyboarding
• Demonstrations, tests, servers for tools hosting• Center of competence
BnF/OCR/Research/Impact/Corpus
Surname Name # Bpp
Lang A 5893 Greyscale Language dataset17th centuryAbout Descartes
Dev
Eval
Demo
19th B 5139 Greyscale 99,9% in Alto format + Alto2Page conversion + manual correction
Newspaper C 200 Greyscale Newspaper
Theater D 34 Black & White Theater
Bad E 505 Black & White Bad quality image
Easy F 470 Black & White Easy book
Struct G 10 Black & White Complex structure
Bnf/OCR/Research/Impact/CorpusA=Lang
Bnf/OCR/Research/Impact/CorpusB=19th
Bnf/OCR/Research/Impact/CorpusC=Newspaper
Bnf/OCR/Research/Impact/CorpusE=Bad
Bnf/OCR/Research/Impact/CorpusF=Easy
Bnf/OCR/Research/Impact/CorpusG=Struct
Bnf/OCR/Research/Impact/RelatedTools
Lang 19th Newspaper
Thea-tre
Bad Easy Struct
OCR (Abbyy + Tesseract) x x x x x x x
Collaborative Correction x x x x x x x
Structure analysis and correction
x x x x
OCR post-correction x
Binarisation & Colour Reduction
x x x
Segmentations x x x x
Lexicon (French) / NE Repository
x
Post Correction
OCR
BnF/OCR/Projects/Wikimedia partnership
http://fr.wikisource.org/wiki/Page:Anomyme_-_Raoul_de_Cambrai.djvu/82
Bnf/OCR/Production/Epub
• OCR is usefull for Image to EBook conversion
• About 200 EBooks from early tender • From images digitized in the 1990’s
• Paper material should be perfect• Image as well
• About 7 000 per year (2012)
Control and evaluation
Control Eval
Bnf/OCR/Research/OCR Evaluation
– OCR evaluation tool• PhD 2007
– OCR rates• Computation• Estimation• Control
• In a mass digitization context (no ground truth)
28
BnF/OCR/Production/OCR Rates
98,6% !
BnF/OCR/Projects/Under Construction
• Recognition on – Music score (OMR)– Mediaval Manuscripts– Portolan charts and maps– Geolocalisation using OCR results
• Croudsourcing on Gallica
• XML Alto to Epub conversion
• Host Impact Center of Competence ?