chapter two preliminaries: sorting out the ingredients how to build a digital library ian h. witten...
TRANSCRIPT
Chapter TwoChapter TwoPreliminaries: Preliminaries:
Sorting out the ingredientsSorting out the ingredients
How to Build a Digital LibraryHow to Build a Digital LibraryIan H. Witten and David BainbridgeIan H. Witten and David Bainbridge
Planning a Digital Planning a Digital LibraryLibrary
ResponsibilitiesResponsibilities Technology to be usedTechnology to be used
Greenstone softwareGreenstone software MetadataMetadata
Summary informationSummary information Types of accessTypes of access Digitizing documentsDigitizing documents
Majority of the workMajority of the work
ResponsibilitiesResponsibilities
Legal IssuesLegal Issues Distributing information carries Distributing information carries
responsibilitiesresponsibilities CopyrightCopyright
Social IssuesSocial Issues Respect customs of the communityRespect customs of the community Both source and use communitiesBoth source and use communities
Ethical issuesEthical issues
Fundamental QuestionsFundamental Questions
What is the purpose of the library?What is the purpose of the library? What are the principles for including What are the principles for including
documents?documents? When does one document differ from When does one document differ from
another?another?
Sources of MaterialSources of Material
Existing library to be converted to Existing library to be converted to digital formdigital form
An existing collection of material to An existing collection of material to be made available as a digital librarybe made available as a digital library
Material already existing on the Web Material already existing on the Web to be organized and presented via a to be organized and presented via a portalportal
Sources of MaterialSources of Material
IdeologyIdeology Converting an existing libraryConverting an existing library Building a new collectionBuilding a new collection Virtual librariesVirtual libraries
IdeologyIdeology
Ideology – a clear conception of Ideology – a clear conception of what you plan to achieve with the what you plan to achieve with the collection of informationcollection of information
Ideology of a Collection:Ideology of a Collection: PurposePurpose ObjectivesObjectives PrinciplesPrinciples
guide what is to be included in the collectionguide what is to be included in the collection
Introduction to Digital Introduction to Digital LibraryLibrary
State the purpose of the collectionState the purpose of the collection
Describe how the collection is Describe how the collection is organizedorganized
Document versus WorkDocument versus Work
WorkWork The disembodied content of a messageThe disembodied content of a message Pure informationPure information
DocumentDocument Traditional library: a physical object that Traditional library: a physical object that
embodies the workembodies the work Digital library: a particular electronic encoding Digital library: a particular electronic encoding
of a workof a work
How are distinctions made between How are distinctions made between different manifestations of a single work?different manifestations of a single work?
Converting an Existing Converting an Existing LibraryLibrary
Digitizing an existing paper-based Digitizing an existing paper-based collection is the most expensive kind collection is the most expensive kind of projectof project
Consider whether it is worth the Consider whether it is worth the effort and expenseeffort and expense
Advantages of Digital Advantages of Digital LibrariesLibraries
Easier to access remotely than Easier to access remotely than conventional librariesconventional libraries
Powerful search and browsingPowerful search and browsing Easier to add additional servicesEasier to add additional services
QuestionsQuestions
Will the digital library coexist with Will the digital library coexist with an existing physical one?an existing physical one?
What is the collection’s growth rate?What is the collection’s growth rate? How dynamic is the collection?How dynamic is the collection? Should you consider outsourcing the Should you consider outsourcing the
whole digital library operation?whole digital library operation? Could user needs be satisfied in Could user needs be satisfied in
alternative ways?alternative ways?
Prioritizing MaterialsPrioritizing Materials
Special collections and unique Special collections and unique materialsmaterials Rare books and manuscriptsRare books and manuscripts
High use itemsHigh use items Research and teaching materialsResearch and teaching materials
Low-use itemsLow-use items
Criteria for Digital Criteria for Digital ConversionConversion
Intellectual contentIntellectual content Scholarly valueScholarly value Desire to enhance access to informationDesire to enhance access to information Funding availableFunding available
Educational valueEducational value Classroom supportClassroom support Background readingBackground reading Distance educationDistance education
InstitutionalInstitutional Resource sharingResource sharing Promote strengths of an institutionPromote strengths of an institution
Reduce handling of fragile originalsReduce handling of fragile originals Cost and space savingsCost and space savings CopyrightCopyright
Principles for Principles for DevelopmentDevelopment
UtilityUtility Local imperativeLocal imperative NoveltyNovelty IntertextualityIntertextuality ResourcesResources Commitment to the transitionCommitment to the transition
Building a New Building a New CollectionCollection
New materialNew material The copyright holder may be the best The copyright holder may be the best
one to create a digital collectionone to create a digital collection MetadataMetadata
Where will it come from?Where will it come from?
Virtual LibrariesVirtual Libraries
A portal to information that is in A portal to information that is in electronic form but located electronic form but located elsewhere on the Internetelsewhere on the Internet
Source information is already Source information is already availableavailable
Some metadata is availableSome metadata is available
Virtual LibrariesVirtual Libraries
Select the contentSelect the content Define a purpose or theme for the libraryDefine a purpose or theme for the library Seek and filter informationSeek and filter information
Focused Web crawlingFocused Web crawling
Obtain additional metadataObtain additional metadata Aids in the organization of the collectionAids in the organization of the collection The higher the educational value of a The higher the educational value of a
resource, the more time should be taken resource, the more time should be taken in generating its descriptionin generating its description
Generating MetadataGenerating Metadatain a Virtual Libraryin a Virtual Library
Automatically generatedAutomatically generated URLURL Author supplied metadataAuthor supplied metadata Keyword extractionKeyword extraction
Manual reviewManual review Edit and enrich the automatically generated Edit and enrich the automatically generated
metadatametadata Intensive description by a human expertIntensive description by a human expert
Provides extensive metadataProvides extensive metadata
Bibliographic Bibliographic OrganizationOrganization
Objectives of a bibliographic systemObjectives of a bibliographic system Bibliographic entitiesBibliographic entities
Original Objectives of aOriginal Objectives of aBibliographic SystemBibliographic System
FindingFinding User seeks a known document when information User seeks a known document when information
such as author, title or subject is knownsuch as author, title or subject is known CollocationCollocation
““To place together or in proper order”To place together or in proper order” Locating similar information by subject matter, Locating similar information by subject matter,
author, etc.author, etc. ChoiceChoice
User must choose between similar documentsUser must choose between similar documents Bibliographically in terms of editionBibliographically in terms of edition Topically in terms of characterTopically in terms of character
Current Objectives of aCurrent Objectives of aBibliographic SystemBibliographic System
LocateLocate Find entities in a file or database as the result of a search Find entities in a file or database as the result of a search
using attributes or relationships of the entitiesusing attributes or relationships of the entities IdentifyIdentify
Confirm entity described in a record is the one soughtConfirm entity described in a record is the one sought SelectSelect
Verify that entity is what the user needsVerify that entity is what the user needs AcquireAcquire
Obtain access through purchase, loan or online accessObtain access through purchase, loan or online access NavigateNavigate
Go through a bibliographic databaseGo through a bibliographic database Find works related by generalization, association, aggregationFind works related by generalization, association, aggregation Find attributes related by equivalence, association and Find attributes related by equivalence, association and
hierarchyhierarchy
Documents in Digital Documents in Digital LibrariesLibraries
DocumentDocument A particular electronic encoding of a workA particular electronic encoding of a work Can be easily duplicatedCan be easily duplicated Uncertain boundariesUncertain boundaries
Digital libraries should present users Digital libraries should present users with an image of stability and with an image of stability and continuitycontinuity as though electronic documents were as though electronic documents were
identifiable, discrete objects like physical identifiable, discrete objects like physical onesones
Bibliographic EntitiesBibliographic Entities DocumentsDocuments WorksWorks
Distinction between document and workDistinction between document and work EditionsEditions
Electronic documents use terms such as Electronic documents use terms such as version, release and revisionversion, release and revision
AuthorsAuthors Authority control – standardized names for Authority control – standardized names for
authorsauthors TitlesTitles
Attributes of worksAttributes of works
Bibliographic EntitiesBibliographic Entities SubjectsSubjects
Two approaches to automatically assign subject:Two approaches to automatically assign subject: Key-phrase extractionKey-phrase extraction Key-phrase assignmentKey-phrase assignment
Literary and artistic worksLiterary and artistic works Style, form, content, genreStyle, form, content, genre
Library of Congress Subject Headings (LCSH)Library of Congress Subject Headings (LCSH) Controlled vocabularies: 30,000 pages, 2,000,000 entriesControlled vocabularies: 30,000 pages, 2,000,000 entries
Hierarchical relationship of broader and narrower Hierarchical relationship of broader and narrower topicstopics
Subject classificationsSubject classifications Traditional libraries have a linear arrangementTraditional libraries have a linear arrangement Digital collection can be rearranged at the click of a Digital collection can be rearranged at the click of a
mousemouse
Modes of AccessModes of Access
WebWeb Terminal in physical libraryTerminal in physical library Standalone computer with CD-ROM or Standalone computer with CD-ROM or
DVDDVD Distributed SystemDistributed System Restricting AccessRestricting Access
FirewallsFirewalls Password protectionPassword protection WatermarkingWatermarking
Digitizing DocumentsDigitizing Documents
DigitizationDigitization The process of taking traditional library The process of taking traditional library
materials and converting them to materials and converting them to electronic formelectronic form
Allows storage and manipulation by a Allows storage and manipulation by a computercomputer
The process is time-consuming and The process is time-consuming and expensiveexpensive
Stages of DigitizationStages of Digitization
ScanningScanning Creates a digitized image of each pageCreates a digitized image of each page Usually presented to the userUsually presented to the user
Optical Character Recognition (OCR)Optical Character Recognition (OCR) Creates a digital representation of the Creates a digital representation of the
textual content of the pagestextual content of the pages Necessary for full-text indexingNecessary for full-text indexing Allows searchingAllows searching
Digitizing DocumentsDigitizing Documents
ScanningScanning Optical character recognitionOptical character recognition Interactive OCRInteractive OCR Page handlingPage handling Planning an image digitization Planning an image digitization
projectproject Inside an OCR shopInside an OCR shop An example projectAn example project
ScanningScanning
Produces a digitized image of each Produces a digitized image of each pagepage
Resembles digitized photographResembles digitized photograph
Decisions in ScanningDecisions in Scanning
Black-and-white, grayscale or colorBlack-and-white, grayscale or color ResolutionResolution
number of pixels per linear unitnumber of pixels per linear unit Bits per pixelBits per pixel
Monochrome display: 16 or 256 levels of Monochrome display: 16 or 256 levels of graygray
Color display: up to 24 or 32 bpiColor display: up to 24 or 32 bpi QualityQuality
Increases storage space and time to accessIncreases storage space and time to access
Optical Character Optical Character RecognitionRecognition
Produces a character-by-character Produces a character-by-character representation of the documentrepresentation of the document
Transforms the scanned image into a Transforms the scanned image into a digitized representation of the page digitized representation of the page contentcontent
Manual cleanup is necessaryManual cleanup is necessary Less efficient than manual keying Less efficient than manual keying
when error rate drops below 95 when error rate drops below 95 percentpercent
Interactive OCRInteractive OCR Optical character recognition should be Optical character recognition should be
done as an interactive processdone as an interactive process AcquisitionAcquisition
Input from scanner or read a fileInput from scanner or read a file CleanupCleanup
Filtering, skewing and manual cleanup of unwanted Filtering, skewing and manual cleanup of unwanted areasareas
Page analysisPage analysis Examine layoutExamine layout
RecognitionRecognition The “OCR” partThe “OCR” part
CheckingChecking SavingSaving
Plain text, HTML, RTF, PDF, MS WordPlain text, HTML, RTF, PDF, MS Word
Page HandlingPage Handling
UnbindingUnbinding Microfiche or microfilmMicrofiche or microfilm Two most expensive partsTwo most expensive parts
Handling the paperHandling the paper OCROCR
Planning a Digitization Planning a Digitization ProjectProject
OutsourcingOutsourcing CostCost
$1 to $2 for scanning and OCR$1 to $2 for scanning and OCR Quality controlQuality control VerificationVerification