anatomy of aggregate collections exploring mass digitization and the “collective collection”...
TRANSCRIPT
Anatomy of Aggregate CollectionsExploring Mass Digitization and the “Collective Collection”
Brian LavoieResearch ScientistOCLC Research
NELINETSeptember 21, 2006
Road map
Aggregate collections
Aggregate collections as a tool for understanding mass digitization projects• “Anatomy of aggregate collections: the example of Google
Print for Libraries” (d-Lib, September 2005)
Digital preservation and mass digitization
Conclusion
The shrinking “width of the border”
Collection A Collection B
Distance Metrics
Economic
Technical
Physical
Aggregate collections
Definition: combined holdings of multiple institutions, viewed as a single collection• 2 institutions, consortium, all libraries everywhere …• WorldCat: aggregate collection of more than 70 million items, held
by more than 25,000 institutions worldwide
Libraries embedded more deeply in networks of collaboration and coordination• Decisions increasingly taken in context of inter-institutional
environments, rather than local collection in isolation • Shift in focus to resources of the “system”, rather than individual
collections
As library networks develop and expand, opportunities arise to create value through collective action, or by aligning local collections with aspects of the system-wide environment
Anatomy of aggregate collections
Analysis of aggregate collections supports …• Collaborative decision-making: direct collaboration by libraries (for
example, collaborative storage strategies)• “Decision-making in context”: local decision-making made in a
larger context (for example, selecting print materials for digitization, given what has already been digitized elsewhere)
Better understanding of the anatomy of aggregate collections critical for wide range of library decision-making contexts:• Collection management (cooperative collection development,
shared off-site storage, collaborative preservation)• Deeper resource sharing (meta-search, reducing frictions in
resource sharing networks)• Mass digitization
OCLC Research activities aimed at mobilizing library data (WorldCat) to understand and manage aggregate collections
Mass digitization and aggregate collections
Google Book Search(aka Google Print for Libraries)
Aggregate collection ofdigitized print books(combined holdings ofHarvard, Michigan, Oxford,NYPL, and Stanford)
Focus on copyright issues;very little discussion ofGoogle Book Search asaggregate collection
http://www.dlib.org/dlib/september05/lavoie/09lavoie.html
The system-wide print book collectionas represented in WorldCat (January 2005)
0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
Total WorldCat Records Language-based monographs Language-based monographs,excluding government
documents andtheses/dissertations
Language-based monographs,excluding government
documents andtheses/dissertations, in print
format only
~55 million
~41 million
~35 million
~32 millionprint books
More information:Schonfeld & Lavoie“Books without Boundaries: A Brief Tour of the System-wide Print Book Collection”Journal of Electronic Publishing, Vol. 9, No. 2, Summer 2006http://www.hti.umich.edu/cgi/t/text/text-idx?c=jep;cc=jep;view=text;rgn=main;idno=3336451.0009.208
G5 coverage of system-wide print book collection
33% Held by at
least one G5library
67%Not held
10.5 millionunique books10.5 million
unique books
Holdings overlap
61%Held by 1
20%Held by 2
10%Held by 3
6%Held by 4
3%Held by 5
Potential redundancyrate of 40 percent
Potential redundancyrate of 40 percent
Language distribution
Language Google 5 System-wideEnglish 0.49 0.52German 0.10 0.08French 0.08 0.08Spanish 0.05 0.06Chinese 0.04 0.04Russian 0.04 0.03Italian 0.03 0.03Japanese 0.02 0.04Hebrew 0.02 0.01Arabic 0.01 0.01Portuguese 0.01 0.01Polish 0.01 0.01Dutch 0.01 0.01Latin 0.01 0.01Korean 0.01 0.01Swedish 0.01 < 0.01All others 0.07 0.08
More than 430languages in
Google 5collection
More than 430languages in
Google 5collection
Cumulative age distribution of G5 holdings
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Years
Pro
po
rtio
n P
ub
lish
ed D
uri
ng
or
Pri
or
To
C
urr
ent
Yea
r
> 80 percent of Google 5collection still in copyright
> 80 percent of Google 5collection still in copyright
Works
10.5 million9.1 million
26.1 million
32 million
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
Manifestations Works
Google 5
System-wide
Coverage slightlyhigher (35 %)
Holdings overlapslightly greater
(56 % held uniquely)
Coverage slightlyhigher (35 %)
Holdings overlapslightly greater
(56 % held uniquely)
Some speculation …
What results would have been obtained if a different group of libraries had been selected?
What incremental extensions to coverage can be obtained by adding additional library collections to original Google 5?
Chose 5 new libraries:• Small US liberal arts college• Large US public university• Large US private university• Large US metropolitan library• Large Canadian university
Beyond the Google 5 …
“New” Google 5 “Original” Google 5Total holdings: ~8 million ~18 millionTotal unique books: 5.9 million 10.5 million% of system-wide: 18 percent 33 percent
Redundantholdings: 26 percent 42 percent
Impact by library type: % of holdings unique relative tooriginal G5 collection:
Large US metropolitan library: 39 percent (most unlike G5)Large US private university: 25 percentLarge Canadian university: 23 percentLarge US public university: 21 percentSmall US liberal arts college: 13 percent (most like G5)
“The Google 10”
OriginalGoogle 5
(10.5 million books)
Google 10 collection:12.3 million books
+ 1.8 million (17 %)
Google 10 collection:12.3 million books
+ 1.8 million (17 %)
Diminishing returns?
Original G5:~18 million holdings58% unique
New G5:~8 million holdings22% unique
The challenge of digital preservation
Capture/Selection
Description
Secure Storage
MediaManagement
Render
“The Preservation Pyramid”Adapted fromPriscilla Caplan (FCLA)
Authenticity/Understandability
ECONOMICS RIGHTS
But …
Chris Rusbridge’s “digital preservation fallacies”:• Digital preservation is very expensive• File formats become obsolete quickly• Interventions must occur frequently• Digital preservation repositories should have very long time-scale
aspirations• The preserved object must be easily and instantly accessible in
contemporary formats• The preserved object must be faithful in all respects to original
Source: Rusbridge, C. “Excuse me … Some Digital Preservation Fallacies?” Ariadne February 2006; http://www.ariadne.ac.uk/issue46/rusbridge/
Bottom Line: significant progress has been made, but:• Still lack well-understood, standardized practices for preserving
digital materials• No consensus on what “successful digital preservation” means
Mass digitization and digital preservation
Roles and responsibilities: Google? Libraries? Elsevier? JSTOR? Digitized books as artifacts to be preserved, or disposable surrogates?Implications for redundancy in system?
What uses can digitized outputbe put to?• Discovery/linking (e.g., mbooks)• Text-mining
Infrastructure to support large-scale digital content management
Efficient, automated workflowsfor preservation metadata
“Last copy”
Summing up …
Distance between collections shrinking; mass digitization programs and other aggregate collections increasingly common features of library landscape
To mobilize aggregate collections, need to understand anatomy of aggregate collections – i.e., data and analysis to support planning and collaboration• Characterize and promote the “collective collection”: the collective
library resource• Chart a course through mass digitization (e.g., G5 study)
Mass digitization raises important questions about long-term preservation (summarized by “preservation pyramid”); need strategies to secure long-term future of digitization investments