Heuristic Approach for Automatic
Metadata Capture of E-books/Journals
ARD PrasadDRTC
Indian Statistical InstituteBangalore
Agenda
Earlier Experiment with printed books Present Experiment with E-Books & E-
Journals
Heuristics for Printed Books
Heuristics for the ... Title page Verso of the title page
Methodology for Printed Books
Scan the title page OCR the image Generate the output in HTML Apply Heuristics to HTML pages Identify the bibliographic elements
Heuristics for Verso of the Title Page
Identify date & edition etc. See whether prenatal cataloging is
available Identify the bibliographic elements in
prenatal catalog Counter check the identifications from the
title page Resolution in case of conflicts
Generating Bibliographic Records
Once the bibliographic elements are identified Generate bibliographic records in
ISO-2707 Dublin Core
Sample Heuristics for Identifying Title
Order of the Bibliographic elements Titles are found in upper or upper middle
portion of the title page. The title appears first in the title page
(75.15 per cent) (In few cases author or series occupies first position.)
Fonts used in title field are the largest fonts (94.99 per cent) compared with the size of fonts in other fields.
If the title and sub-title occurred in the same line, they are separated by “:” (colon) or “-” (hyphen).
It is not necessary that title should have only alphabetic characters. Title string may have numerals, punctuation marks like comma, hyphen and others.
Usually titles have the terms like “The”, “An”, “Introduction”, “Theory”, “in”, “to”.
Heuristics for other elements
Sub titles Edition Volume Authors/ Contributor Publisher Place Year Series
Present Experiment
E-Books (from sites like amazon.com ) E-Journals (Non-OAI compliant)
Methodology
Template based Identification Heuristic based Identification
Disadvantages of Template Based Approach
For every new site / templates are to be created
A site may change the appearance and require you to develop more than one template for each site or journal
Methodology
Study few sites to develop heuristics Web Crawler to probe the site Identify the files having documents (filter
irrelevant files) Apply heuristics on the files having e-
documents Generating Dublin Core Records
Thank You
Welcome to International Conference on
Semantic Web & Digital Libraries
21st – 23rd February, 2007Indian Statistical Institute
Bangalore