co:op-read-convention marburg - günter mühlberger

21
Recognition and Enrichment of Archival Documents

Upload: icarus-international-centre-for-archival-research

Post on 13-Apr-2017

502 views

Category:

Science


2 download

TRANSCRIPT

Page 1: co:op-READ-Convention Marburg - Günter Mühlberger

Recognition and Enrichmentof Archival Documents

Page 2: co:op-READ-Convention Marburg - Günter Mühlberger

Facts and Figures• READ

• Recognition and Enrichment of Archival Documents• 13 Partners, coordinated by the University of Innsbruck• 10 Institutions as associated partners via a Memorandum of

Understanding• Duration: 1.1.2016 to 30.6.2019• Grant: 8,2 mill. EUR

• Objectives• Applied research in pattern recognition and human language

technology• Services for archives, humanities scholars, volunteers and

computer scientists• Network building among those user groups

Page 3: co:op-READ-Convention Marburg - Günter Mühlberger

READ ConsortiumREAD Partners

University of Innbruck(co-ordinator)

University of London

Technical University Valencia Technical University Lausanne

University College London University of Rostock

National Centre for Scientific Research - Demokritos

XEROX – European Research Centre

Technical University Vienna University of Leipzig

National Archive Finland Diozesan Archive Passau

Page 4: co:op-READ-Convention Marburg - Günter Mühlberger

READ MoU PartnersREAD MoU Partners

Australian National Library Gottfried Wilhelm Leibniz Bibliothek

National Library of Spain Centre virtuel de la connaissancesur l'Europe Digital Humanities Lab (Luxembourg)

The Linnean Society of London The Hessian State Archive Marburg (Germany)

The Munch Museum (Norway) The Civic Archives of BozenBolzano (Italy)

Music and Instrument Museum Leipzig

The University and Research Library Erfurt/Gotha (Germany)

Friedrich-August-UniversitätErlangen/Nürnberg

PLANET GmbH. (Germany)

Page 5: co:op-READ-Convention Marburg - Günter Mühlberger

What will remain once the project has finished its work in June 2019?

Page 6: co:op-READ-Convention Marburg - Günter Mühlberger

Publications• H2020 Grant Agreement• Article 29.2 Open access to scientific publications

• Each beneficiary must ensure open access (free of charge online access for any user) to all peer-reviewed scientific publications relating to its results.

• Open Access• Golden way

• E.g. FrontiersIn from EPFL (Technical University Lausanne)• Green way

• Key Performance Indicator• 15-25 scientific publications per year

Page 7: co:op-READ-Convention Marburg - Günter Mühlberger

Research Data• H2020 Grant Agreement• 29.3 Open access to research data

• Regarding the digital research data generated in the action (‘data’), the beneficiaries must:

• (a) deposit in a research data repository and take measures to make it possible for third parties to access, mine, exploit, reproduce and disseminate — free of charge for any user — the following:

• (i) the data, including associated metadata, needed to validate the results presented in scientific publications as soon as possible;

Page 8: co:op-READ-Convention Marburg - Günter Mühlberger

What are Research Data in READ?• Images and corresponding Reference Data

(= ground truth)• Images = Raw material• Reference data or ground truth = the expected, perfect

output• Data = what is actually produced by an algorithm/tool

• Example• Image of a page• Correct text of a page = reference data• Data = the text produced by a HTR engine • Difference between expected result and actual result =

the result of a scientific experiment, e.g. measured as Word Error Rate

Page 9: co:op-READ-Convention Marburg - Günter Mühlberger

Research Data are used for…• Evaluation

• Difference between expected and actual result• Problem description / requirements specification

• What do we actually expect from an algorithm or tool?• Simple with HTR, but becomes much more complicated with

Layout Analysis• E.g. do we need the whole text of a page, or maybe just

person names within one column of a table? Such questions need to be defined and need to be reflected in the design of the reference data

• Machine learning (training data)• Machine learning tools need training data• Reference data are the basis for this training process

Page 10: co:op-READ-Convention Marburg - Günter Mühlberger

Research Data in READ• Key Performance Indicator

• 3 Mill. Images with at least 50.000 pages of reference data at the end of the project

• Why such a large amount?• Our objective is that the READ dataset is “somehow”

representative for many document types in archives, for writing and layout styles of several centuries and languages

• We are therefore very much interested in any kind of digitised document collection

• Progress in computer science is strongly connected to the availability of large data sets

Page 11: co:op-READ-Convention Marburg - Günter Mühlberger

Research Data for Competitions• Key Performance Indicator

• READ will organise several research competitions at various conferences

• Competitions• Nowadays a popular way to measure the progress of research

in a specific field. E.g. line detection, or text recognition, or writer retrieval…

• Evaluation of competition results• Depends on the availability of reference data

• Attractiveness of competitions• Dependent on the challenge itself, but also on the size of

dataset and the quality of reference data• 160.000 EUR are foreseen as sub-contracts for the production

of reference data

Page 12: co:op-READ-Convention Marburg - Günter Mühlberger

Research Data in READ• Images will be connected with reference data such as:

• Correct text (e.g. on page or line level)• Correct writer attribution (e.g. letters with names of writers)• Correct person names on page level• Correct layout elements, e.g. text lines, text blocks, tables, or

forms• Detailed descriptions of tables or forms• Everything which is interesting for archives, scholars, the

public!

• Data will be made available e.g. via ZENODO or other Research Data Platforms

• Archives are encouraged to provide their collections!

Page 13: co:op-READ-Convention Marburg - Günter Mühlberger

Open Source Software• Release as OS

• Not an obligation of the Grant Agreement, but from the specific e-Infrastructure call of the EU

• Foreseen for (nearly) all software tools in the project• During 2016 we will take the first steps and move parts

of the software to GITHUB or a similar platform

• Advantage• Many tools are research tools and therefore “not easy”

to implement• The implementation in Transkribus will allow users to try

out the tools in beforehand

Page 14: co:op-READ-Convention Marburg - Günter Mühlberger

Interim summary• Open Access to publications

• E.g. via Open Access publishers

• Open Research Data (images and reference data)• E.g. via Repositories, such as ZENODO (run by CERN Data

service)

• Open Source for the software tools• E.g. via open software repositories, such as GITHUB

An (expert) user will have “everything together” to dive deeper into the results of the project

Page 15: co:op-READ-Convention Marburg - Günter Mühlberger

Open Platform

Build a platform which provides recognition, transcription and enrichment of historical documents as a general infrastructure for archives, libraries, humanities scholars, volunteers, the public – and computer scientists.

Page 16: co:op-READ-Convention Marburg - Günter Mühlberger

Why a Platform? (1)• Software as a Service (SAAS)

• Implementation of the full range of tools from READ requires a lot of work and knowhow

• The entrance hurdle for archives and humanities scholars is much lower since the services can be accessed and used via the Internet

• E.g. users are free to upload their documents, to run tests and to further decide which services they want to use

• Machine Learning• Most tools require large amounts of training data • The more data are available in the platform the higher the

chance to improve accuracy• E.g.: if a user in Greifswald transcribes a German text from

1700 these data may also be used to train the HTR engine for a user in Bavaria. Or in the US.

Page 17: co:op-READ-Convention Marburg - Günter Mühlberger

Why a platform? (2)• Cooperation

• Successful digitisation projects need collaboration between content holders, scholars, computer scientists and volunteers

• Platform serves as a mediation tool between these stakeholder groups

• E.g. they can define requirements, produce reference data, implement new services, edit and correct results in a shared manner

• Standardisation• Full benefits of technology can only be enjoyed if a large

variety of standards is obeyed• De-facto standardisation by using the same platform and

tools• E.g. the real benefit of digital editions will be enjoyed once

they are centrally accessible

Page 18: co:op-READ-Convention Marburg - Günter Mühlberger

Service Platform• READ Service Platform = Transkribus

• We are obliged to run the service platform from the very first day of the project

• We are also obliged to provide a business plan in month 12• And to implement this business plan after month 12

• Final objective• To run and maintain the service platform also after the end of

the project• A business model needs to be developed

• General approach• Service levels • To provide free services for everyone – only if some limits are

exceeded than service fees will be applied

Page 19: co:op-READ-Convention Marburg - Günter Mühlberger

Overview of tools and services• Handwritten Text Recognition

• HTR based on HMM and on NN

• Keyword Spotting• Query by Example• Query by String

• Image Preprocessing• Binarisation, Enhancement

• Layout Analysis• Basic analysis of words, lines, region types (text, graphical,…)

• Table and Forms Recognition• Generic and template based recognition

Page 20: co:op-READ-Convention Marburg - Günter Mühlberger

Overview of tools and services• Document Understanding

• Columns, marginalia, date, etc.• Automatic Writer Identification and Retrieval

• Training and retrieval of specific writers/writing styles• Language Toolkit

• Adaptation of language resources to support HTR• Text2Image matching

• Matching existing text with images• E-Learning module

• Online training tool for students and volunteers to practise deciphering of handwritten documents

• ScanApp• The mobile phone as document scanner with direct connection to the

Transkribus service platform• And many more…

Page 21: co:op-READ-Convention Marburg - Günter Mühlberger

READ Platformhttp://transkribus.eu/

READ Websitehttp://read.transkribus.eu/ (coming soon)

User’s guidehttp://transkribus.eu/wiki/

Thank you a lot for your attention!