heuristic approach for automatic metadata capture of e- books/journals ard prasad drtc indian...

14
Heuristic Approach for Automatic Metadata Capture of E-books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Upload: meghan-sherman

Post on 04-Jan-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Heuristic Approach for Automatic

Metadata Capture of E-books/Journals

ARD PrasadDRTC

Indian Statistical InstituteBangalore

Page 2: Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Agenda

Earlier Experiment with printed books Present Experiment with E-Books & E-

Journals

Page 3: Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Heuristics for Printed Books

Heuristics for the ... Title page Verso of the title page

Page 4: Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Methodology for Printed Books

Scan the title page OCR the image Generate the output in HTML Apply Heuristics to HTML pages Identify the bibliographic elements

Page 5: Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Heuristics for Verso of the Title Page

Identify date & edition etc. See whether prenatal cataloging is

available Identify the bibliographic elements in

prenatal catalog Counter check the identifications from the

title page Resolution in case of conflicts

Page 6: Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Generating Bibliographic Records

Once the bibliographic elements are identified Generate bibliographic records in

ISO-2707 Dublin Core

Page 7: Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Sample Heuristics for Identifying Title

Order of the Bibliographic elements Titles are found in upper or upper middle

portion of the title page. The title appears first in the title page

(75.15 per cent) (In few cases author or series occupies first position.)

Fonts used in title field are the largest fonts (94.99 per cent) compared with the size of fonts in other fields.

Page 8: Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

If the title and sub-title occurred in the same line, they are separated by “:” (colon) or “-” (hyphen).

It is not necessary that title should have only alphabetic characters. Title string may have numerals, punctuation marks like comma, hyphen and others.

Usually titles have the terms like “The”, “An”, “Introduction”, “Theory”, “in”, “to”.

Page 9: Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Heuristics for other elements

Sub titles Edition Volume Authors/ Contributor Publisher Place Year Series

Page 10: Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Present Experiment

E-Books (from sites like amazon.com ) E-Journals (Non-OAI compliant)

Page 11: Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Methodology

Template based Identification Heuristic based Identification

Page 12: Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Disadvantages of Template Based Approach

For every new site / templates are to be created

A site may change the appearance and require you to develop more than one template for each site or journal

Page 13: Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Methodology

Study few sites to develop heuristics Web Crawler to probe the site Identify the files having documents (filter

irrelevant files) Apply heuristics on the files having e-

documents Generating Dublin Core Records

Page 14: Heuristic Approach for Automatic Metadata Capture of E- books/Journals ARD Prasad DRTC Indian Statistical Institute Bangalore

Thank You

Welcome to International Conference on

Semantic Web & Digital Libraries

21st – 23rd February, 2007Indian Statistical Institute

Bangalore