anatomy of aggregate collections exploring mass digitization and the “collective collection”...

Anatomy of Aggregate CollectionsExploring Mass Digitization and the “Collective Collection”

Brian LavoieResearch ScientistOCLC Research

NELINETSeptember 21, 2006

Road map

Aggregate collections

Aggregate collections as a tool for understanding mass digitization projects• “Anatomy of aggregate collections: the example of Google

Print for Libraries” (d-Lib, September 2005)

Digital preservation and mass digitization

Conclusion

The shrinking “width of the border”

Collection A Collection B

Distance Metrics

Economic

Technical

Physical

Aggregate collections

Definition: combined holdings of multiple institutions, viewed as a single collection• 2 institutions, consortium, all libraries everywhere …• WorldCat: aggregate collection of more than 70 million items, held

by more than 25,000 institutions worldwide

Libraries embedded more deeply in networks of collaboration and coordination• Decisions increasingly taken in context of inter-institutional

environments, rather than local collection in isolation • Shift in focus to resources of the “system”, rather than individual

collections

As library networks develop and expand, opportunities arise to create value through collective action, or by aligning local collections with aspects of the system-wide environment

Anatomy of aggregate collections

Analysis of aggregate collections supports …• Collaborative decision-making: direct collaboration by libraries (for

example, collaborative storage strategies)• “Decision-making in context”: local decision-making made in a

larger context (for example, selecting print materials for digitization, given what has already been digitized elsewhere)

Better understanding of the anatomy of aggregate collections critical for wide range of library decision-making contexts:• Collection management (cooperative collection development,

shared off-site storage, collaborative preservation)• Deeper resource sharing (meta-search, reducing frictions in

resource sharing networks)• Mass digitization

OCLC Research activities aimed at mobilizing library data (WorldCat) to understand and manage aggregate collections

Mass digitization and aggregate collections

Google Book Search(aka Google Print for Libraries)

Aggregate collection ofdigitized print books(combined holdings ofHarvard, Michigan, Oxford,NYPL, and Stanford)

Focus on copyright issues;very little discussion ofGoogle Book Search asaggregate collection

http://www.dlib.org/dlib/september05/lavoie/09lavoie.html

The system-wide print book collectionas represented in WorldCat (January 2005)

0

10,000,000

20,000,000

30,000,000

40,000,000

50,000,000

60,000,000

Total WorldCat Records Language-based monographs Language-based monographs,excluding government

documents andtheses/dissertations

Language-based monographs,excluding government

documents andtheses/dissertations, in print

format only

~55 million

~41 million

~35 million

~32 millionprint books

More information:Schonfeld & Lavoie“Books without Boundaries: A Brief Tour of the System-wide Print Book Collection”Journal of Electronic Publishing, Vol. 9, No. 2, Summer 2006http://www.hti.umich.edu/cgi/t/text/text-idx?c=jep;cc=jep;view=text;rgn=main;idno=3336451.0009.208

G5 coverage of system-wide print book collection

33% Held by at

least one G5library

67%Not held

10.5 millionunique books10.5 million

unique books

Holdings overlap

61%Held by 1

20%Held by 2

10%Held by 3

6%Held by 4

3%Held by 5

Potential redundancyrate of 40 percent

Potential redundancyrate of 40 percent

Language distribution

Language Google 5 System-wideEnglish 0.49 0.52German 0.10 0.08French 0.08 0.08Spanish 0.05 0.06Chinese 0.04 0.04Russian 0.04 0.03Italian 0.03 0.03Japanese 0.02 0.04Hebrew 0.02 0.01Arabic 0.01 0.01Portuguese 0.01 0.01Polish 0.01 0.01Dutch 0.01 0.01Latin 0.01 0.01Korean 0.01 0.01Swedish 0.01 < 0.01All others 0.07 0.08

More than 430languages in

Google 5collection

More than 430languages in

Google 5collection

Cumulative age distribution of G5 holdings

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Years

Pro

po

rtio

n P

ub

lish

ed D

uri

ng

or

Pri

or

To

C

urr

ent

Yea

r

> 80 percent of Google 5collection still in copyright

> 80 percent of Google 5collection still in copyright

Works

10.5 million9.1 million

26.1 million

32 million

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Manifestations Works

Google 5

System-wide

Coverage slightlyhigher (35 %)

Holdings overlapslightly greater

(56 % held uniquely)

Coverage slightlyhigher (35 %)

Holdings overlapslightly greater

(56 % held uniquely)

Some speculation …

What results would have been obtained if a different group of libraries had been selected?

What incremental extensions to coverage can be obtained by adding additional library collections to original Google 5?

Chose 5 new libraries:• Small US liberal arts college• Large US public university• Large US private university• Large US metropolitan library• Large Canadian university

Beyond the Google 5 …

“New” Google 5 “Original” Google 5Total holdings: ~8 million ~18 millionTotal unique books: 5.9 million 10.5 million% of system-wide: 18 percent 33 percent

Redundantholdings: 26 percent 42 percent

Impact by library type: % of holdings unique relative tooriginal G5 collection:

Large US metropolitan library: 39 percent (most unlike G5)Large US private university: 25 percentLarge Canadian university: 23 percentLarge US public university: 21 percentSmall US liberal arts college: 13 percent (most like G5)

“The Google 10”

OriginalGoogle 5

(10.5 million books)

Google 10 collection:12.3 million books

+ 1.8 million (17 %)

Google 10 collection:12.3 million books

+ 1.8 million (17 %)

Diminishing returns?

Original G5:~18 million holdings58% unique

New G5:~8 million holdings22% unique

The challenge of digital preservation

Capture/Selection

Description

Secure Storage

MediaManagement

Render

“The Preservation Pyramid”Adapted fromPriscilla Caplan (FCLA)

Authenticity/Understandability

ECONOMICS RIGHTS

But …

Chris Rusbridge’s “digital preservation fallacies”:• Digital preservation is very expensive• File formats become obsolete quickly• Interventions must occur frequently• Digital preservation repositories should have very long time-scale

aspirations• The preserved object must be easily and instantly accessible in

contemporary formats• The preserved object must be faithful in all respects to original

Source: Rusbridge, C. “Excuse me … Some Digital Preservation Fallacies?” Ariadne February 2006; http://www.ariadne.ac.uk/issue46/rusbridge/

Bottom Line: significant progress has been made, but:• Still lack well-understood, standardized practices for preserving

digital materials• No consensus on what “successful digital preservation” means

Mass digitization and digital preservation

Roles and responsibilities: Google? Libraries? Elsevier? JSTOR? Digitized books as artifacts to be preserved, or disposable surrogates?Implications for redundancy in system?

What uses can digitized outputbe put to?• Discovery/linking (e.g., mbooks)• Text-mining

Infrastructure to support large-scale digital content management

Efficient, automated workflowsfor preservation metadata

“Last copy”

Summing up …

Distance between collections shrinking; mass digitization programs and other aggregate collections increasingly common features of library landscape

To mobilize aggregate collections, need to understand anatomy of aggregate collections – i.e., data and analysis to support planning and collaboration• Characterize and promote the “collective collection”: the collective

library resource• Chart a course through mass digitization (e.g., G5 study)

Mass digitization raises important questions about long-term preservation (summarized by “preservation pyramid”); need strategies to secure long-term future of digitization investments

anatomy of aggregate collections exploring mass digitization and the “collective collection”...

Documents