1 cs 502: computing methods for digital libraries lecture 28 current work in preservation

21
1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

Upload: paula-barrett

Post on 12-Jan-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

1

CS 502: Computing Methods for Digital Libraries

Lecture 28

Current work in preservation

Page 2: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

2

Administration

Review class

• Tuesday, 12:20. Room to be announced on web site "Notices".

• Format, questions (by you) and answers (by me).

Laptops

• Return before examination. Bring receipt to examination.

Examination

• Part 1: 5 questions, 1.5 hour time limit

• Part 2: nomad experiment questionnaire, no time limit

Page 3: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

3

Education and research

Digital libraries in a state of flux:

• Much of this class has described material that is still experimental

• Cornell people and our colleagues are actively involved in many aspects

This class:

• Recent activities in preservation of materials on the web

• Some of my recent work

Page 4: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

4

Some light reading

William Y. Arms, "Preservation of scientific serials: three current examples." Journal of Electronic Publishing, 5(2), December 1999. http://www.press.umich.edu/jep/05-02/arms.html

William Y. Arms, "Economic models for open-access publishing." iMP, March 2000. http://www.cisp.org/imp/march_2000/03_00arms.htm

Page 5: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

5

Preservation of serials

September 1999 -- Workshop chaired by Deanna Marcum, Don Waters, Cliff Lynch

Issues in preserving online journals for 100 years

Invited paper by William Arms

"Preservation of Scientific Serials: Three Current Examples"

• ACM Digital Library• Internet RFC Series• D-Lib Magazine

Motivated by realization that early preservation work may be tackling the wrong problem

Page 6: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

6

Publisher's role in preservation

Life cycle of electronic publication

1. Active management by publisher

2. Long-term preservation by another organization

Overall observation

• The length of #1 may be very short or hundreds of years

• The most vulnerable time is the transition between #1 and #2

Preservation discussions have emphasized #2 (e.g., 5 level model)

Page 7: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

7

ACM Digital Library

Organizational

• ACM is a stable organization that considers the Digital Library one of its principal assets

Rights

• ACM either owns copyright or has full preservation rights

Technical

• Complex: relational database (schema), SGML (DTD), rendering software, private metadata system

• Strong computing department

Replication

• No independent mirrors

Page 8: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

8

Internet RFC Series

Organizational

• Complex relationship between Internet Society (ISCO), Internet Engineering Task Force (IETF) and RFC editor. Currently actively managed, but no long-term commitment

• Secretariat & RFC editor -- income from meetings & grants

Rights

• ISOC and IETF have very broad rights

Technical

• Simple: text only (a few PostScript)

Replication

• Several independent mirrors

Page 9: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

9

D-Lib Magazine

Organizational

• Published by CNRI, reliant on grants.

Rights

• Authors own rights in articles. CNRI owns rights in other materials.

Technical

• Simple: uses basic web technology.

• Used for experiments in DOIs, XML metadata, etc.

Replication

• Several independent mirrors

Page 10: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

10

Approaches to preservation of the web

Partnership with publishers

Publishers and libraries as partners

Selective collection of open access web

Librarianship in a new domain

Bulk collection of open access web

Automatic librarianship

Page 11: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

11

Partnerships with publishers

Library of Congress and UMI

• US theses and dissertations

American Physical Society and Cornell University

• Journals in physics

Elsevier Science

• Policy statement on archiving

Page 12: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

12

Partnership with publishers

Publishers and libraries as partners

Selective collection of open access web

Librarianship in a new domain

Bulk collection of open access web

Automatic librarianship

Approaches to preservation of the web

Cornell and Library of Congress

Page 13: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

13

Selective preservation

Selection of web sites

Example: National Library of Australia

• national importance

• multiple versions (print and online)

• authority and research value

Page 14: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

14

Selection of web sites

Pragmatic considerations

• technical complexity

-- not all standards are good

• frequency of making copies

• COST

Librarianship in a new domain

Page 15: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

15

Catalogs and indexes

Example: CORC

• simple standard using Dublin Core

• tools for creating records

• COST

Librarianship in a new domain

Page 16: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

16

Bulk collection: automatic librarianship

Volumes of information are too great for human selection, indexing and management

Examples:

• Kulturarw3 -- National Library of Sweden

• Internet Archive -- Brewster Kahle

Automatic methods are used to collect, organize and provide access

Page 17: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

17

Automatic librarianship

Collection

Example: Internet Archive

• Collecting open access web since 1996

• Complete sweep of web approximately once a month

• HTML pages only

• 14 terabytes of data (soon all online)

• access for researchers using Unix tools

• 7 people

Page 18: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

18

Automatic librarianship

Indexing

Examples:

• ResearchIndex

• Google

Page 19: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

19

Legal issues

Legal position of archives that download open access materials is unclear

• Preservation is in the national interest

• See the discussion in The Digital Dilemma (National Academy of Sciences, 1999)

• Crucial factor is economic impact on copyright owners

• Library of Congress has no special position except via copyright deposit

• U.S. Copyright Office offer to help clarification

Page 20: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

20

Current activities

Selection: guidelines and prototypes

• Library of Congress working group

• Political web sites

Tools

• Web site mirroring

• Web site profiler (M.Eng. project)

Copyright

• Ad hoc working group (Deanna Marcum, Bill Arms)

Page 21: 1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

21

CS 502Computing Methods for Digital

Libraries

THE END