docworks/metae the engine for automated metadata extraction and xml tagging claus gravenhorst
DESCRIPTION
docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst Content Conversion Specialists. CCS – Offices. What is docWORKS/METAe?. Production tool for conversion of printed documents into fully tagged digital objects - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/1.jpg)
1July 2004 – METS Opening Day UK www.ccs-gmbh.de 1
docWORKS/METAe
The Engine for Automated Metadata Extraction and XML Tagging
Claus Gravenhorst
Content Conversion Specialists
![Page 2: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/2.jpg)
2July 2004 – METS Opening Day UK www.ccs-gmbh.de 2
CCS – Offices
What is docWORKS/METAe?
Production tool for conversion of printed documents into fully tagged digital objects
The METAe edition of docWORKS is the result of the EU-funded project METAe
Start of project: September 2000
End of project: August 2003
Product launch: March 2003, CeBIT exhibition
![Page 3: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/3.jpg)
3July 2004 – METS Opening Day UK www.ccs-gmbh.de 3
CCS – Offices
The project group
1. Leopold-Franzens-Universität Innsbruck (Co-ordinator), Austria
2. Universität Linz, Institut für Angewandte Informatik, University of Linz, Austria
3. Mitcom Neue Medien GmbH (ABBYY Europe), Germany
4. CCS Compact Computer Systeme, Germany
5. Universidad de Alicante, Spain
6. Friedrich-Ebert-Stiftung, Germany
7. Cornell University Library. Department of Preservation and Conservation, USA
8. Bibliothèque nationale de France
9. The National Library of Norway, Rana division, Norway
10. Biblioteca Statale A. Baldini, Italy
11. Dipartimento di Sistemi e Informatica, University of Florence, Italy
12. Karl-Franzens-Universität Graz, Universitätsbibliothek, Austria
13. Scuola Normale Superiore, Centro di Ricerche Informatiche per i Beni Culturali, Italy
14. Higher Education Digitisation Service HEDS, UK
![Page 4: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/4.jpg)
4July 2004 – METS Opening Day UK www.ccs-gmbh.de 4
CCS – Offices
Challenges
Digitization and retro-conversion of printed or textual material is getting more and more important:
Keep knowledge and cultural heritage alive
Preserve the origin
Enable quick and enhanced access by high structured documents
Open up new dimensions of research
Provide standardized output formats
![Page 5: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/5.jpg)
5July 2004 – METS Opening Day UK www.ccs-gmbh.de 5
CCS – Offices
Goals
Automate the conversion process
Make digitization more effective and safer
Increase the added value of digitized collections
Provide a standardized output format in order to allow transformation of metadata into various applications and systems
![Page 6: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/6.jpg)
6July 2004 – METS Opening Day UK www.ccs-gmbh.de 6
CCS – Offices
docWORKS – System Overview
document METS/ALTOMETS/TEI
PDFTIFF, JPEG
Image Pre-Processing
Layout Analysis
Character Recognition
Structural Analysis
Scanning
Import
Correction
Export
RulesDB
docWORKS engineInput Output
![Page 7: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/7.jpg)
7July 2004 – METS Opening Day UK www.ccs-gmbh.de 7
CCS – Offices
docWORKS – recording as much metadata as possible!
Available data
Descriptive metadata
Administra-tive
metadata
Structural metadata -
logical
Structural metadata -
physical
Formats Library records, e.g.
MARCTIFF Images
METSDC or MODS
linking tocatalogue
record
METS incl.
NISO (mix)
METS Structural
map
ALTO (Analyzed Layout and Text Object)
docWORKSengine
Import of subsets,
linking to record
Creates descriptive
records for articles, pictures,…
Records metadata
Suggests labels of logical
elements and structures
Provides suggestionfor physical
structure
Usermode
Automated Semi-automated
Correction recommended
Fully-automated
after defininga profile
Automated
Correctionrecommended
Automated
Correction in special cases
![Page 8: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/8.jpg)
8July 2004 – METS Opening Day UK www.ccs-gmbh.de 8
CCS – Offices
docWORKS – Matching of Image Files and Page Numbers
Image-file
Pagination Page-Number
000001.tif Not counted Np
000002.tif Not counted Np
000003.tif Counted I
000004.tif Counted II
000005.tif Counted III
000006.tif Counted IV
000007.tif Counted V
000008.tif Counted VI
000009.tif Counted 1
000010.tif Counted, not paginated (2)
000011.tif Counted 3
000012.tif Counted 4
placeholder Missing page 5
placeholder Missing page 6
000013.tif Counted 7
000014.tif Counted 8
![Page 9: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/9.jpg)
9July 2004 – METS Opening Day UK www.ccs-gmbh.de 9
CCS – Offices
docWORKS – Structural Analysis
FRONT
MAIN
BACK
![Page 10: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/10.jpg)
10July 2004 – METS Opening Day UK www.ccs-gmbh.de 10
CCS – Offices
docWORKS – Structural Analysis
Chapter 1
Chapter 2
Subchapter 1 Subchapter 2
![Page 11: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/11.jpg)
11July 2004 – METS Opening Day UK www.ccs-gmbh.de 11
CCS – Offices
docWORKS – Structural Analysis
Preface
Table of contentsTitlepage Statement page
![Page 12: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/12.jpg)
12July 2004 – METS Opening Day UK www.ccs-gmbh.de 12
CCS – Offices
docWORKS – Document layers
Various document layers are differentiated automatically and while using certain levels enable well directed searches as well as the presentation of electronic text without unnecessary items
Body text independently from its presentation Margin notes, footnotes Pictures and captions Advertisement Annex and supplements Navigation layer: Table of contents, running title,
document index , page number, volume index Book: Separation of „intellectual“ and „artifical“ content
![Page 13: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/13.jpg)
13July 2004 – METS Opening Day UK www.ccs-gmbh.de 13
CCS – Offices
docWORKS – Digitization of books and journals (METAe)
![Page 14: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/14.jpg)
14July 2004 – METS Opening Day UK www.ccs-gmbh.de 14
CCS – Offices
docWORKS – Digitization of books and journals (METAe)
![Page 15: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/15.jpg)
15July 2004 – METS Opening Day UK www.ccs-gmbh.de 15
CCS – Offices
docWORKS – Digitization of scientific documents
![Page 16: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/16.jpg)
16July 2004 – METS Opening Day UK www.ccs-gmbh.de 16
CCS – Offices
docWORKS – Manual editing of descriptive metadata / volume
![Page 17: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/17.jpg)
17July 2004 – METS Opening Day UK www.ccs-gmbh.de 17
CCS – Offices
docWORKS – Manual editing of descriptive metadata / illustration
![Page 18: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/18.jpg)
18July 2004 – METS Opening Day UK www.ccs-gmbh.de 18
CCS – Offices
docWORKS – Basic Workflow
DigitizationScanning
DBOPACMARC
Quality ControlImages Conversion Quality Control
Output ExportPresentation
XML/METSPDF
![Page 19: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/19.jpg)
19July 2004 – METS Opening Day UK www.ccs-gmbh.de 19
CCS – Offices
docWORKS – Scalable Client / Server architecture
Server 1 Server 2 Server n....
ScanImport
QualityControl
Server 3 Auto-Import Image Preprocessing Layout Analysis OCR Structural Analysis Export
![Page 20: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/20.jpg)
20July 2004 – METS Opening Day UK www.ccs-gmbh.de 20
CCS – Offices
docWORKS – METS / ALTO
METSdocument
TIFF ALTO
ALTO – Analyzed Layout and Text Object
![Page 21: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/21.jpg)
21July 2004 – METS Opening Day UK www.ccs-gmbh.de 21
CCS – Offices
docWORKS – METS
Header MODS or DC, descriptive metadata NISO 39.087 (mix), technical metadata Structural Map: Physical Structure Structural Map: Logical Structure
![Page 22: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/22.jpg)
22July 2004 – METS Opening Day UK www.ccs-gmbh.de 22
CCS – Offices
docWORKS – ALTO Styles
- Paragraph (alignment, linespacing, etc.) - Font (name, size, bold, italic, etc.)
Layout
- Printspace - TopMargin - InnerMargin - OuterMargin - BottomMargin
Objects in 5 areas above:
- Text block - Text lines - Strings [coordinates, string (as
printed), substitution (hyphenation)] - Spaces
- Composed block - Picture - Table - Formula
![Page 23: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/23.jpg)
23July 2004 – METS Opening Day UK www.ccs-gmbh.de 23
CCS – Offices
docWORKS – METS / physical structure
METS
DC
FILEGRP
PHYS
LOGICAL
DC
FILEGRP
PHYS
LOGICAL
ORDER12345678910111213141516…
LABEL
IIIIIIVVVI
2345
6…
ORDERLABEL
IIIIIIIVVVI
12345
6 …
![Page 24: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/24.jpg)
24July 2004 – METS Opening Day UK www.ccs-gmbh.de 24
CCS – Offices
docWORKS – METS / physical structure
par
fptr
fptr
METS
DC
FILEGRP
PHYS
LOGICAL
DIV(page)
FILE
ID
ALTO
FILE
ID
IMAGE
![Page 25: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/25.jpg)
25July 2004 – METS Opening Day UK www.ccs-gmbh.de 25
CCS – Offices
docWORKS – METS / logical structure
seq
fptr
fptr
METS
DC
FILEGRP
PHYS
LOGICAL
DIV(paragraph)
DIV(volume)DCMD_PHYS
DCMD_ELEC DIV(issue)DCMD_ISSUE#
DIV(contrib.)DCMD_#CONT#
FILE
ID
FILE
ID
ALTO
ALTO
Those who have read the History of Columbus will, doubtless, remember the character and exploits ...
XSLT
XSLT
text block
text block
BEGIN
BEG
IN
FILEID
FILEID
Coordinates
Coordinates
DIV(chapter)DCMD_CHAP#
![Page 26: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/26.jpg)
26July 2004 – METS Opening Day UK www.ccs-gmbh.de 26
CCS – Offices
docWORKS – ALTO / page layout and text content
![Page 27: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/27.jpg)
27July 2004 – METS Opening Day UK www.ccs-gmbh.de 27
CCS – Offices
docWORKS – ALTO / hyphenated word
![Page 28: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/28.jpg)
28July 2004 – METS Opening Day UK www.ccs-gmbh.de 28
CCS – Offices
docWORKS – ALTO / hyphenated word
![Page 29: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/29.jpg)
29July 2004 – METS Opening Day UK www.ccs-gmbh.de 29
CCS – Offices
docWORKS – Workshop UK 2004
University Library of SouthamptonSeptember 28/29, free of charge
1st day Product information Output, metadata standards Workflow, use cases
2nd day „Hands on“ – Working with your own samples Individual consultancy sessions
Contact Simon Brackenbury - [email protected] Hartmut Janczikowski - [email protected]
![Page 30: docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815d66550346895dcb6eb0/html5/thumbnails/30.jpg)
30July 2004 – METS Opening Day UK www.ccs-gmbh.de 30
CCS – Offices
Thank you!
Claus [email protected]
Content Conversion Specialists www.ccs-gmbh.de
http://meta-e.uibk.ac.at/