chapter two preliminaries: sorting out the ingredients how to build a digital library ian h. witten...

35
Chapter Two Chapter Two Preliminaries: Preliminaries: Sorting out the ingredients Sorting out the ingredients How to Build a Digital Library How to Build a Digital Library Ian H. Witten and David Bainbridge Ian H. Witten and David Bainbridge

Upload: egbert-lindsey

Post on 18-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Chapter TwoChapter TwoPreliminaries: Preliminaries:

Sorting out the ingredientsSorting out the ingredients

How to Build a Digital LibraryHow to Build a Digital LibraryIan H. Witten and David BainbridgeIan H. Witten and David Bainbridge

Page 2: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Planning a Digital Planning a Digital LibraryLibrary

ResponsibilitiesResponsibilities Technology to be usedTechnology to be used

Greenstone softwareGreenstone software MetadataMetadata

Summary informationSummary information Types of accessTypes of access Digitizing documentsDigitizing documents

Majority of the workMajority of the work

Page 3: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

ResponsibilitiesResponsibilities

Legal IssuesLegal Issues Distributing information carries Distributing information carries

responsibilitiesresponsibilities CopyrightCopyright

Social IssuesSocial Issues Respect customs of the communityRespect customs of the community Both source and use communitiesBoth source and use communities

Ethical issuesEthical issues

Page 4: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Fundamental QuestionsFundamental Questions

What is the purpose of the library?What is the purpose of the library? What are the principles for including What are the principles for including

documents?documents? When does one document differ from When does one document differ from

another?another?

Page 5: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Sources of MaterialSources of Material

Existing library to be converted to Existing library to be converted to digital formdigital form

An existing collection of material to An existing collection of material to be made available as a digital librarybe made available as a digital library

Material already existing on the Web Material already existing on the Web to be organized and presented via a to be organized and presented via a portalportal

Page 6: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Sources of MaterialSources of Material

IdeologyIdeology Converting an existing libraryConverting an existing library Building a new collectionBuilding a new collection Virtual librariesVirtual libraries

Page 7: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

IdeologyIdeology

Ideology – a clear conception of Ideology – a clear conception of what you plan to achieve with the what you plan to achieve with the collection of informationcollection of information

Ideology of a Collection:Ideology of a Collection: PurposePurpose ObjectivesObjectives PrinciplesPrinciples

guide what is to be included in the collectionguide what is to be included in the collection

Page 8: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Introduction to Digital Introduction to Digital LibraryLibrary

State the purpose of the collectionState the purpose of the collection

Describe how the collection is Describe how the collection is organizedorganized

Page 9: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Document versus WorkDocument versus Work

WorkWork The disembodied content of a messageThe disembodied content of a message Pure informationPure information

DocumentDocument Traditional library: a physical object that Traditional library: a physical object that

embodies the workembodies the work Digital library: a particular electronic encoding Digital library: a particular electronic encoding

of a workof a work

How are distinctions made between How are distinctions made between different manifestations of a single work?different manifestations of a single work?

Page 10: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Converting an Existing Converting an Existing LibraryLibrary

Digitizing an existing paper-based Digitizing an existing paper-based collection is the most expensive kind collection is the most expensive kind of projectof project

Consider whether it is worth the Consider whether it is worth the effort and expenseeffort and expense

Page 11: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Advantages of Digital Advantages of Digital LibrariesLibraries

Easier to access remotely than Easier to access remotely than conventional librariesconventional libraries

Powerful search and browsingPowerful search and browsing Easier to add additional servicesEasier to add additional services

Page 12: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

QuestionsQuestions

Will the digital library coexist with Will the digital library coexist with an existing physical one?an existing physical one?

What is the collection’s growth rate?What is the collection’s growth rate? How dynamic is the collection?How dynamic is the collection? Should you consider outsourcing the Should you consider outsourcing the

whole digital library operation?whole digital library operation? Could user needs be satisfied in Could user needs be satisfied in

alternative ways?alternative ways?

Page 13: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Prioritizing MaterialsPrioritizing Materials

Special collections and unique Special collections and unique materialsmaterials Rare books and manuscriptsRare books and manuscripts

High use itemsHigh use items Research and teaching materialsResearch and teaching materials

Low-use itemsLow-use items

Page 14: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Criteria for Digital Criteria for Digital ConversionConversion

Intellectual contentIntellectual content Scholarly valueScholarly value Desire to enhance access to informationDesire to enhance access to information Funding availableFunding available

Educational valueEducational value Classroom supportClassroom support Background readingBackground reading Distance educationDistance education

InstitutionalInstitutional Resource sharingResource sharing Promote strengths of an institutionPromote strengths of an institution

Reduce handling of fragile originalsReduce handling of fragile originals Cost and space savingsCost and space savings CopyrightCopyright

Page 15: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Principles for Principles for DevelopmentDevelopment

UtilityUtility Local imperativeLocal imperative NoveltyNovelty IntertextualityIntertextuality ResourcesResources Commitment to the transitionCommitment to the transition

Page 16: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Building a New Building a New CollectionCollection

New materialNew material The copyright holder may be the best The copyright holder may be the best

one to create a digital collectionone to create a digital collection MetadataMetadata

Where will it come from?Where will it come from?

Page 17: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Virtual LibrariesVirtual Libraries

A portal to information that is in A portal to information that is in electronic form but located electronic form but located elsewhere on the Internetelsewhere on the Internet

Source information is already Source information is already availableavailable

Some metadata is availableSome metadata is available

Page 18: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Virtual LibrariesVirtual Libraries

Select the contentSelect the content Define a purpose or theme for the libraryDefine a purpose or theme for the library Seek and filter informationSeek and filter information

Focused Web crawlingFocused Web crawling

Obtain additional metadataObtain additional metadata Aids in the organization of the collectionAids in the organization of the collection The higher the educational value of a The higher the educational value of a

resource, the more time should be taken resource, the more time should be taken in generating its descriptionin generating its description

Page 19: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Generating MetadataGenerating Metadatain a Virtual Libraryin a Virtual Library

Automatically generatedAutomatically generated URLURL Author supplied metadataAuthor supplied metadata Keyword extractionKeyword extraction

Manual reviewManual review Edit and enrich the automatically generated Edit and enrich the automatically generated

metadatametadata Intensive description by a human expertIntensive description by a human expert

Provides extensive metadataProvides extensive metadata

Page 20: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Bibliographic Bibliographic OrganizationOrganization

Objectives of a bibliographic systemObjectives of a bibliographic system Bibliographic entitiesBibliographic entities

Page 21: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Original Objectives of aOriginal Objectives of aBibliographic SystemBibliographic System

FindingFinding User seeks a known document when information User seeks a known document when information

such as author, title or subject is knownsuch as author, title or subject is known CollocationCollocation

““To place together or in proper order”To place together or in proper order” Locating similar information by subject matter, Locating similar information by subject matter,

author, etc.author, etc. ChoiceChoice

User must choose between similar documentsUser must choose between similar documents Bibliographically in terms of editionBibliographically in terms of edition Topically in terms of characterTopically in terms of character

Page 22: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Current Objectives of aCurrent Objectives of aBibliographic SystemBibliographic System

LocateLocate Find entities in a file or database as the result of a search Find entities in a file or database as the result of a search

using attributes or relationships of the entitiesusing attributes or relationships of the entities IdentifyIdentify

Confirm entity described in a record is the one soughtConfirm entity described in a record is the one sought SelectSelect

Verify that entity is what the user needsVerify that entity is what the user needs AcquireAcquire

Obtain access through purchase, loan or online accessObtain access through purchase, loan or online access NavigateNavigate

Go through a bibliographic databaseGo through a bibliographic database Find works related by generalization, association, aggregationFind works related by generalization, association, aggregation Find attributes related by equivalence, association and Find attributes related by equivalence, association and

hierarchyhierarchy

Page 23: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Documents in Digital Documents in Digital LibrariesLibraries

DocumentDocument A particular electronic encoding of a workA particular electronic encoding of a work Can be easily duplicatedCan be easily duplicated Uncertain boundariesUncertain boundaries

Digital libraries should present users Digital libraries should present users with an image of stability and with an image of stability and continuitycontinuity as though electronic documents were as though electronic documents were

identifiable, discrete objects like physical identifiable, discrete objects like physical onesones

Page 24: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Bibliographic EntitiesBibliographic Entities DocumentsDocuments WorksWorks

Distinction between document and workDistinction between document and work EditionsEditions

Electronic documents use terms such as Electronic documents use terms such as version, release and revisionversion, release and revision

AuthorsAuthors Authority control – standardized names for Authority control – standardized names for

authorsauthors TitlesTitles

Attributes of worksAttributes of works

Page 25: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Bibliographic EntitiesBibliographic Entities SubjectsSubjects

Two approaches to automatically assign subject:Two approaches to automatically assign subject: Key-phrase extractionKey-phrase extraction Key-phrase assignmentKey-phrase assignment

Literary and artistic worksLiterary and artistic works Style, form, content, genreStyle, form, content, genre

Library of Congress Subject Headings (LCSH)Library of Congress Subject Headings (LCSH) Controlled vocabularies: 30,000 pages, 2,000,000 entriesControlled vocabularies: 30,000 pages, 2,000,000 entries

Hierarchical relationship of broader and narrower Hierarchical relationship of broader and narrower topicstopics

Subject classificationsSubject classifications Traditional libraries have a linear arrangementTraditional libraries have a linear arrangement Digital collection can be rearranged at the click of a Digital collection can be rearranged at the click of a

mousemouse

Page 26: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Modes of AccessModes of Access

WebWeb Terminal in physical libraryTerminal in physical library Standalone computer with CD-ROM or Standalone computer with CD-ROM or

DVDDVD Distributed SystemDistributed System Restricting AccessRestricting Access

FirewallsFirewalls Password protectionPassword protection WatermarkingWatermarking

Page 27: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Digitizing DocumentsDigitizing Documents

DigitizationDigitization The process of taking traditional library The process of taking traditional library

materials and converting them to materials and converting them to electronic formelectronic form

Allows storage and manipulation by a Allows storage and manipulation by a computercomputer

The process is time-consuming and The process is time-consuming and expensiveexpensive

Page 28: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Stages of DigitizationStages of Digitization

ScanningScanning Creates a digitized image of each pageCreates a digitized image of each page Usually presented to the userUsually presented to the user

Optical Character Recognition (OCR)Optical Character Recognition (OCR) Creates a digital representation of the Creates a digital representation of the

textual content of the pagestextual content of the pages Necessary for full-text indexingNecessary for full-text indexing Allows searchingAllows searching

Page 29: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Digitizing DocumentsDigitizing Documents

ScanningScanning Optical character recognitionOptical character recognition Interactive OCRInteractive OCR Page handlingPage handling Planning an image digitization Planning an image digitization

projectproject Inside an OCR shopInside an OCR shop An example projectAn example project

Page 30: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

ScanningScanning

Produces a digitized image of each Produces a digitized image of each pagepage

Resembles digitized photographResembles digitized photograph

Page 31: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Decisions in ScanningDecisions in Scanning

Black-and-white, grayscale or colorBlack-and-white, grayscale or color ResolutionResolution

number of pixels per linear unitnumber of pixels per linear unit Bits per pixelBits per pixel

Monochrome display: 16 or 256 levels of Monochrome display: 16 or 256 levels of graygray

Color display: up to 24 or 32 bpiColor display: up to 24 or 32 bpi QualityQuality

Increases storage space and time to accessIncreases storage space and time to access

Page 32: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Optical Character Optical Character RecognitionRecognition

Produces a character-by-character Produces a character-by-character representation of the documentrepresentation of the document

Transforms the scanned image into a Transforms the scanned image into a digitized representation of the page digitized representation of the page contentcontent

Manual cleanup is necessaryManual cleanup is necessary Less efficient than manual keying Less efficient than manual keying

when error rate drops below 95 when error rate drops below 95 percentpercent

Page 33: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Interactive OCRInteractive OCR Optical character recognition should be Optical character recognition should be

done as an interactive processdone as an interactive process AcquisitionAcquisition

Input from scanner or read a fileInput from scanner or read a file CleanupCleanup

Filtering, skewing and manual cleanup of unwanted Filtering, skewing and manual cleanup of unwanted areasareas

Page analysisPage analysis Examine layoutExamine layout

RecognitionRecognition The “OCR” partThe “OCR” part

CheckingChecking SavingSaving

Plain text, HTML, RTF, PDF, MS WordPlain text, HTML, RTF, PDF, MS Word

Page 34: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Page HandlingPage Handling

UnbindingUnbinding Microfiche or microfilmMicrofiche or microfilm Two most expensive partsTwo most expensive parts

Handling the paperHandling the paper OCROCR

Page 35: Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge

Planning a Digitization Planning a Digitization ProjectProject

OutsourcingOutsourcing CostCost

$1 to $2 for scanning and OCR$1 to $2 for scanning and OCR Quality controlQuality control VerificationVerification