Download - Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Heuristic Approach for Automatic

Metadata Capture of E-books/Journals

ARD PrasadDRTC

Indian Statistical InstituteBangalore

Agenda

Earlier Experiment with printed books Present Experiment with E-Books & E-

Journals

Heuristics for Printed Books

Heuristics for the ... Title page Verso of the title page

Methodology for Printed Books

Scan the title page OCR the image Generate the output in HTML Apply Heuristics to HTML pages Identify the bibliographic elements

Heuristics for Verso of the Title Page

Identify date & edition etc. See whether prenatal cataloging is

available Identify the bibliographic elements in

prenatal catalog Counter check the identifications from the

title page Resolution in case of conflicts

Generating Bibliographic Records

Once the bibliographic elements are identified Generate bibliographic records in

ISO-2707 Dublin Core

Sample Heuristics for Identifying Title

Order of the Bibliographic elements Titles are found in upper or upper middle

portion of the title page. The title appears first in the title page

(75.15 per cent) (In few cases author or series occupies first position.)

Fonts used in title field are the largest fonts (94.99 per cent) compared with the size of fonts in other fields.

If the title and sub-title occurred in the same line, they are separated by “:” (colon) or “-” (hyphen).

It is not necessary that title should have only alphabetic characters. Title string may have numerals, punctuation marks like comma, hyphen and others.

Usually titles have the terms like “The”, “An”, “Introduction”, “Theory”, “in”, “to”.

Heuristics for other elements

Sub titles Edition Volume Authors/ Contributor Publisher Place Year Series

Present Experiment

E-Books (from sites like amazon.com ) E-Journals (Non-OAI compliant)

Methodology

Template based Identification Heuristic based Identification

Disadvantages of Template Based Approach

For every new site / templates are to be created

A site may change the appearance and require you to develop more than one template for each site or journal

Methodology

Study few sites to develop heuristics Web Crawler to probe the site Identify the files having documents (filter

irrelevant files) Apply heuristics on the files having e-

documents Generating Dublin Core Records

Thank You

Welcome to International Conference on

Semantic Web & Digital Libraries

21st – 23rd February, 2007Indian Statistical Institute

Bangalore

Download - Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Top Related