impact interoperability and evaluation framework. clemens neudecker
DESCRIPTION
Presentada en "Sesión de demostración de IMPACT en la BNE" en octubre, en la Biblioteca Nacional de España (BNE).TRANSCRIPT
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Interoperability and Evaluation FrameworkClemens Neudecker, National Library of the NetherlandsIMPACT Demo Day, Biblioteca Nacional de España
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR: A multitude of challenges…I. OCR challenges (gothic fonts, bleed-through, warping, etc.)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR: A multitude of challenges…II. Language challenges (spelling variants, inflection, and many more!)
Example: historical variants of the Dutch word ‘wereld’ (world):werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
And a multitude of solutions!22 different ‘tools’ from diverse developers:OCR (C++, C#), Image Processing & Lexica (DLL), Command Line Tools (Win/Linux), Java, Ruby, PHP, Perl, etc. + 3rd party software!
“One ring to rule them all...”
→ IMPACT Interoperability Framework (IIF)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Main requirementsBehavioural:
Minimize integration effortMinimize deployment effortMaximize usabilityMaximize scalability
Functional:ModularTransparentExpandableOpen sourcePlatform independent
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
ArchitectureIMPACT Interoperability Framework: Technologies- Java 6- Generic Web Service Wrapper- Apache Ant/Maven- Apache Tomcat/httpd- Apache Axis2- Apache Synapse- Taverna Workflow Engine
IMPACT Evaluation Framework: Dataset- approx. 5 TB raw data (images, text files, metadata) and growing- Ground truth transcriptions- Evaluation modules
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Components I: IIF
Enterprise Service Busreceives (SOAP) requests from users and distributes the load to the availableworker nodes
Main effect: Process parallelization,Load distribution,Fail over
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Framework integrationEasy to use generic command line wrapper (open source)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow development
OCR workflow = data pipeline
Building blocks = processing steps (nodes)
Integration = interaction between nodes(mashup)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow managementWeb 2.0 style registry: myExperimentLocal client: Taverna WorkbenchWeb client: project websiteAPI: SOAP/REST
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
CommunityWeb2.0 style workflow registry
Community of experts
Sharing of resources
Knowledge exchange
A central meeting point for users and researchers
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Components II: DatasetDatabase and front end, hosted at the PRIMA research group at University of Salford, School of Computing, United Kingdom
- more than 500.000 images from Digital Libraries- more than 50.000 ground truth representations- up to 10.000 direct access calls per month- 4 TB of space and growing
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
DatasetAccess to a representative and annotated dataset of significant size, with metadata, ground truth and search facilities
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation featuresText based comparison of result with ground truth, using Levenshtein distance methodLayout based comparison of result with ground truth,using the Page Analysis And Ground Truth Elements FrameworkExample:
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Ground-Truthing Tools
Aletheia
FineReaderPAGE Exporter
GT Validator
GT Normalizer
16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18
Partial MissMiss
Merge
Measures – Segmentation Errors
Split
Ground Truth
Segmentation Result
Mis-classi-fication
Paragraph
Caption
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR Accuracy
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Thank you! Questions?