developing digital libraries: general strategies sukhdev singh based on “from paper to...

37
Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten. Human Info NGO, Belgium; Simple Words, Romania; University of waikato, New

Upload: noah-wilkinson

Post on 12-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Developing Digital Libraries:

General Strategies

Sukhdev SinghBased on “From Paper to Collection” by Dr.

Michel Loots, Dan Camarzan and Ian H Witten. Human Info NGO, Belgium; Simple

Words, Romania; University of waikato, New Zealand.

Page 2: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

INTRODUCTION

– Typical steps that have to be implemented are:i. Selecting the documents to be includedii. Securing copyrights permissions to use these documents in thedigital libraryiii. Scanning and OCR of the hard-copy documents which are notavailable in to digital form to have a perfect digital formativ. Converting all documents to a format (integrating text andimages).

Page 3: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

– v. Tagging the chapters, paragraphs and images of the digitaldocumentsvi. Organising the collection into a optimally structured digitallibraryvii. Building the digital library using the Greenstone softwareviii. Printing and distributing the collection on CD-ROM and/ordistributing it over the Internet

Page 4: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Answers the following before jumping in:

• What is the goal of your collection?

• What is your target group?

• How big is it—local, regional, or globaly?

• How many documents are you making available?

• How many pages?

• How much graphics content?

Page 5: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

• Does the material split into parts that will be consulted by a limitedaudience and parts that need to be disseminated widely?

• Are the documents already available electronically?• If so, in which formats? (Note incidentally that PDF

files are notautomatically equivalent to digital full-text form, as they oftencontain only page images.)

• What is the copyright status of the documents?

Page 6: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

• Who owns the copyright?• Are there other organizations with the same target

audience?• Are you willing to collaborate with other groups?• What budget is available for the whole project?• What human resources are available (in person-

months) for co-ordination,editing, scanning and programming?

• How many computers are available for this project?

Page 7: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

• How many CD-ROMs do you want to distribute?

• Will they be free, or for sale?

Page 8: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

SCANNERS AND SCANNING

• The first step in converting paper documents into a digital librarycollection is to obtain images of all pages of all publications in digitalformat.

Page 9: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Scanners

– Scanners are available in all price ranges, and all shapes and sizes. Theyrange from Rs. 5000 for flat-bed scanners to upwards of Rs. 25,00,000 for large industrial scanners from manufacturers such as Bell & Howell.

Page 10: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

– The output format of a scanned page is a computer file that is usually stored in TIFF or Bitmap format. Compressed TIFF IV is the best format to use. An average page scanned and converted to this format occupies only 50 Kb, compared to perhaps 2 Mb for the equivalent page in uncompressed Bitmap form.

Page 11: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Low-cost flat-bed scanner

– Low-cost flat-bed units are the cheapest and most widely available type of scanner. There are many brands: HP, Agfa, Acer, etc. Prices range from 5,000 to 15,000. Both black-and-white and color images can be scanned.The low price allows each computer to have its own scanner.

Page 12: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

– Fact is that rates exceeding twelve pages per hour are rarelyachieved.

Page 13: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

– these scanners are useful only for small jobs with limited numbers of pages—no more than 200 to 400 pages a month on a regularbasis, or one-time jobs of up to 1000 or 2000 pages.

Page 14: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Low-end scanner with sheet feeder

– Low-end scanners with sheet feeders typically cost between Rs. 25,000 and Rs. 60,000. Ten to fifty pages can be inserted, scanned and processed at once: thus the operator does not have to attend constantly to the machine. This increases capacity up to 150 to 200 pages per day. These scanners are more robust, and have a larger lifespan before repair—usually in therange 30,000 to 50,000 pages.

Page 15: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

–A disadvantage is that only one side of the page is scanned at a time—the stack of pages must be reversed and rescanned in order to obtain an image of both sides. This often creates problems because sheet feeders are never without problems and sometimes pages get blocked. These scanners are useful for up to 1500 to 3000 pages a month.

Page 16: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Color scanners

– Any scanning operation invariably involves some color images, so a color scanner will always be required. Generally speaking, less than 5% of any publication contains color images, plus the cover. Thus a low cost flat-bed scanner as described above suffices. It is advisable to select one capable of scanning up to 600 dpi resolution.

Page 17: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Professional duplex scanners

– Professional scanners are reliable, heavy-duty machines capable of processing a large volume of pages—typically from 2000 pages to 10,000 pages per day. They have an automatic sheet-feeder tray system that processes batches of about 50 to 200 pages. The best and fastest are duplex machines that scan both sides of the page at once.

Page 18: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

– Professional duplex scanners require a powerful computer with a hard disk of at least 10 to 20 Gb. Prices range from Rs. 2,50,000 to Rs. 25,00,000. For example, the Canon DR-6020 duplex scanner costs Rs.2,50,000 and works with double-sided documents. It has a capacity of about 2000 pages per day and a lifespan of 600,000 to 800,000 pages. Bell & Howell and Fujitsu scanners range from Rs. 5,00,000 to Rs. 25,00,000 and have a lifespan of many millions of pages.

Page 19: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Scanning programs

– Every scanner comes with its own software, which means that the program must be installed on the computer that manages the scanner. Some have a computer card that needs to be installed in your computer to speed up the scanning operation.

Page 20: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Preparing the documents

– Before being scanned, documents must be properly prepared. Dusty documents must be cleaned, humid documents dried, clips removed, pages unfolded.The spine of each book should be removed by cutting it off, straight and precisely. Books provided by libraries must often be rebound, and if so you should be particularly careful when removing spines in order tofacilitate smooth rebinding.

Page 21: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

The scanning process

– Using software provided with the scanner, a digital image of each paper page is scanned and transformed into a Bitmap or TIFF image.

Page 22: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Quality control

– First divide the material into batches with similar paper and print qualities. Perform OCR tests on a sample from the first batch to determine the optimal settings. Then scan all material in this batch beforeproceeding to the next one.

Page 23: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Filename conventions

– Follow a suitable file naming convention.

Page 24: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Productivity and resources

– You should not underestimate the magnitude of the scanning operation— and particularly the OCR process that follows. It is best to consider scanning and OCR as completely separate activities. The optimal choicefrom an economic and practical point of view should be made individually for each one.

Page 25: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Scanning costs

– An important decision is whether to invest in scanning equipment and perform all scanning oneself, or outsource it to a scanning company. The main considerations are:

Page 26: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

pressure of time for the scanning job;total number of pages;salary costs of those who perform the scanning

Page 27: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

OCR: OPTICAL CHARACTER RECOGNITION

– The following steps are involved in converting paper documents tocomputer form:1. scanning;2. page layout analysis;3. recognition;4. scanning images and tables.

Page 28: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

– On the market are many good OCR programs, with prices ranging from Rs. 5000 to Rs.20,000.

– For example, among many others are:•Read-Iris (http://www.readiris.com/)•Omnipage (http://www.omnipage.com/)•Fine-Reader (http://www.finereader.com/)

Page 29: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

– A choice must be made between undertaking the scanning and OCR in-house or outsourcing it to a commercial organization. To do it in-houserequires a scanner, OCR software program, OCR skill development, and a quality-conscious, highly motivated workforce.

Page 30: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

The OCR process

– The OCR process differs from one OCR program to another, and each one requires a considerable amount of learning. The program’s manual will explain this process in detail. Four points deserve particular attention: quality control, tables, images, and specialized material such as formulas, foreign characters etc.

Page 31: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Quality control

– The first is performed at the same time as OCR. Every OCR program hasa built-in spell-checker that highlights every suspect letter.

– The second is a general check of the text once the OCR process isfinished.

Page 32: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

– The third is a spelling check using Microsoft Word.

– Finally, the completed document should be checked by an independent person who samples the complete book and checks for errors, problems with tables and images, tagging, and the general look of the resulting text.

Page 33: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

• Tables

• Images

• Specialized material

Page 34: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Productivity and resources

•Young people between 18 and 25 are best suited for this job.•Because the first hours are always the most productive, the work should either be organized on a part-time basis or only the most motivated and concentrated people should be selected for full-time work.

Page 35: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

•Two-thirds of people tend to quit or get bored after about three to five weeks. This translates into poorer quality and low productivity in the last weeks.•A regular supply of work is needed to justify the necessary training, to maintain concentration, and to keep spirits high.

28 Pages / Day

Page 36: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

Alternatives to OCR

– Manual retyping– Image files– Combining scanning and OCR

Page 37: Developing Digital Libraries: General Strategies Sukhdev Singh Based on “From Paper to Collection” by Dr. Michel Loots, Dan Camarzan and Ian H Witten

CREATING AN ELECTRONIC COLLECTION

– Meta Data– Tagging document files