archiving the web: why bother ? la times (march 2000)

31
Archiving the Web: why bother ? LA Times (March 2000)

Upload: gerard-todd

Post on 17-Dec-2015

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Archiving the Web: why bother ? LA Times (March 2000)

Archiving the Web: why bother ?

LA Times (March 2000)

Page 2: Archiving the Web: why bother ? LA Times (March 2000)

Archiving the Web: why bother ?

• “Web sites are an increasingly important part of [an] institution’s digital assets and of [a] country’s information and cultural heritage.” (JISC – April 2002)

• “A lot of history is born digital. This should not be like early television where there is no record.” (Brewster Kahle – May 2002)

Page 3: Archiving the Web: why bother ? LA Times (March 2000)

Archiving the Web: who bothers?

• Australia• USA• Nordic countries: Denmark, Finland,

Sweden• Other countries: UK, France, Japan• Internet Archive

– “Wayback Machine”

Page 4: Archiving the Web: why bother ? LA Times (March 2000)

Three conferences:

• What’s next for Digital Deposit Libraries ? Darmstadt, September 2001

• International Symposium on Web Archiving. Tokyo, January 2002

• DPC Forum: Web-archiving. London, March 2002.

Page 6: Archiving the Web: why bother ? LA Times (March 2000)

Issues and Questions

• Legal Deposit of Digital Information ?– European Union Copyright Directive

• Copyright ?

• Open or closed access ?

• Selective or comprehensive ?

• When in the life cycle ? How often ?

• Capturing the experience – – Dynamic web sites

Page 7: Archiving the Web: why bother ? LA Times (March 2000)

Technical challenges

• Embedded external links and executable programs

• Persistent naming and date stamping

• Duplicate control

• Change in content over time

• Surface web vs Deep web

Page 8: Archiving the Web: why bother ? LA Times (March 2000)

Australia (PANDORA Archive) – NLA http://www.nla.gov.au/pandora

• As yet no legal deposit.• Mandate for collecting C’wlth

Government publications • Selective

– (Australian e-journals, organisational sites, government publications, ephemera)

• Accessible by public – Catalogued in the NBD

Page 9: Archiving the Web: why bother ? LA Times (March 2000)

Australia (PANDORA Archive)

• ~1700 titles in the Archive (Nov. 2001)– Growth rate: 40 sites/month

– Regathering: 35 sites/month

• ADRI (Australian Digital Resource Identifier) – Unique identification scheme

– In-house resolving system

Page 10: Archiving the Web: why bother ? LA Times (March 2000)

USA (Minerva) - Library of Congress

• (Mapping the Internet Electronic Resources Virtual Archive)

• Open access materials from the Web• Changes in copyright law under

discussion• Selective inclusion• Public access

Page 11: Archiving the Web: why bother ? LA Times (March 2000)

LC/IA Pilot Project – “Election 2000”

• Joint pilot project Library of Congress and Internet Archive• Objectives:

– Library pilot : selection, collection and cataloguing web sites; build prototype access system

– Internet Archive pilot: gain experience in harvesting and archiving sites

• Over 800 websites (150+ selected sites and major sites hyperlinked to/from those sites)

• 2-3 terabytes of data• Archived daily August 2000 to January 2001

Page 12: Archiving the Web: why bother ? LA Times (March 2000)

Denmark http://www.netarchive.dk

• Royal Library, Copenhagen. • Limited legal deposit of electronic publications

– Static, not dynamic publications – finite units• Access only from workstations at Royal Library

and State and University Library• Archiving static websites (monographs,

periodicals)• Server mirrored nightly to State and University

Library, Arhus

Page 13: Archiving the Web: why bother ? LA Times (March 2000)

Denmark (Statistics)

• June 2001 - archived 9000 net publications – 31% monographs, 69% periodicals

– 67.5% public sector/university, 32.5 private sector publications

• Staff resources 0.5 technical; 0.8 librarian

Page 14: Archiving the Web: why bother ? LA Times (March 2000)

Sweden (Royal Library)

• Take snapshots of Swedish Web several times/year

– No selection - take everything

– All www pages in Sweden, all articles in e-journals, all Swedish newspapers

– Definition of Sweden: .se - .com, .org .net with Swedish address or telephone number

– Archive only - no public access as yet.

Page 15: Archiving the Web: why bother ? LA Times (March 2000)

Sweden (Software)

• Uses Whois to identify Swedish sites in non-.se domains

• Harvesting with COMBINE Robot software (Univ. of Lund)– Collects papers by automatically following

hypertext links– Also collects pictures and sound– Fully automatic - no human intervention

Page 16: Archiving the Web: why bother ? LA Times (March 2000)

Swedish Archive (Kulturarw3) http://www.kb.se/kw3

• Everything associated with an object and metadata stored in one file as a multipart MIME object

• Name of the file: 33 character string with time stamp

• Sept 2001: 110 million files - 3000 Gbytes of data from 97,000 web servers

• Stored on disk and magnetic tapes using Hierarchical Storage Management (HSM)

Page 17: Archiving the Web: why bother ? LA Times (March 2000)

Swedish Archive (Kulturarw3) (2)• Prior to July 2002: Limited legal deposit (fixed

form e- documents)

• December 2001 : Data Inspection Board team

confirms project is illegal. Project suspended • July 2002. Amendments to Swedish copyright law.

Gives Royal Library right to collect the Swedish

web and to make the archive publicly available.

Page 18: Archiving the Web: why bother ? LA Times (March 2000)

Finland - National Library

• Follows Swedish approach - only .fi domain initially

• Finnish Copyright Act under revision to permit harvesting web resources

• Uses harvesting software developed in Finland from NEDLIB specification

• Archive Metadata– Uses MD5 checksum for duplicate control,

authentication and create unique access key– Time stamped upon retrieval

Page 19: Archiving the Web: why bother ? LA Times (March 2000)

Finland - Results of current Harvesting Round (1)

• Harvesting round 2001-2002– Commenced August 2001 - completed in April

2002

– 9.4 million files from 29 million locations (URL’s)

– Compressed data occupied 340 Gbytes of storage

– Stored on a tape robot in national supercomputing centre

– Hardware used: Sun E450 server

Page 20: Archiving the Web: why bother ? LA Times (March 2000)

FINLAND - Results of current Harvesting Round (2)

• Finnish experience: “the NEDLIB harvester can deal with any national Web space (except perhaps the USA) with reasonably modest hardware, provided that there is sufficient storage space available somewhere”. (Juha Haleka, leader of the Finnish team)

Page 21: Archiving the Web: why bother ? LA Times (March 2000)

Nordic Web Archive

• Joint project of Nordic national libraries• Not dependent on what harvester is used

– NEDLIB (Finland, Norway, Denmark), COMBINE (Sweden)

• Selected Norwegian search engine (FAST)• Software

– Convert documents from 100 different MIME types to HTML

– Recognises most European languages

• Budget: 260,000 Euros (AUS $475,000)

Page 22: Archiving the Web: why bother ? LA Times (March 2000)

“The homogeneous (surface) Web”

59.3% - Text/HTML

37.9% - Image (GIF,JPEG,PNG)

1.7% - PDF

1.1% - Other formats

1.5 million HTML

1 million GIF

550,000 JPEG

36,500 PDF

11,800 plain text

6,000 Word

5,300 Java

etc.

DenmarkFinland

Page 23: Archiving the Web: why bother ? LA Times (March 2000)

United Kingdom (1)

• British Library– “Domain.uk” experiment (commenced 2002)

• Select and capture 100 UK websites (2001 election, GM crops)

• Email selected sites for approval• Revisit every three weeks• Uses Bluesquirrel Web Whacker software• Audit change, loss and links over time

– Intention to scale up (2004 funding bid)

Page 24: Archiving the Web: why bother ? LA Times (March 2000)

United Kingdom (2)• UKOLN Research Project

– Estimates of size of .uk domain: 3 million sites, 24 million pages

• Wellcome Library/JISC Archiving Study to find a solution to web archiving– The “medical web” – Consultancy awarded March 2002- Completion

date October 2002.– Draft report August 2002. Final report to be

disseminated to the community

Page 25: Archiving the Web: why bother ? LA Times (March 2000)

Germany

• (Deutsche Bibliotek)– Experiments with targeted

harvesting

– Two incomplete snapshots 12/2000 and 02/2001

Page 26: Archiving the Web: why bother ? LA Times (March 2000)

France• (Bibliotheque de France)

– In 2001: two experiments with small numbers of sites (16,100) , including music, video and multimedia.

– Unsatisfactory results:• Unexpected features • Exceptionally large sites

– Planning new feasibility study with with 2 different robot providers

– Change in legal deposit law proposed in June 2001. Not yet adopted by Parliament.

Page 27: Archiving the Web: why bother ? LA Times (March 2000)

Japan

• National Diet Library

• WARP (Web Archiving Program)

• Initially selective

• Major changes in Japanese copyright law expected to permit more comprehensive collecting.

Page 28: Archiving the Web: why bother ? LA Times (March 2000)

Internet Archive (1)• Founded by Brewster Kahle in 1996 - $15 million from

sale of WAIS• Non-profit organisation.

– Sponsors include AT&T Research, Compaq, Xerox PARC, Quantum DLT, National Science Foundation.

• Archived web pages from 1996+, movies from 1903 to 1973

• Site has archived over 10 billion pages (Oct. 2001) = more than 100 terabytes

• Growth rate : 10 terabytes/month

Page 29: Archiving the Web: why bother ? LA Times (March 2000)

Internet Archive (2)• Complete sweep of the Web every two months• “Robot exclusions” - many newspapers, individuals,

photographers• Complete copy of Archive at Bibliotheca Alexandrina

(April 2002)• Duplicates in other continents proposed. “Best method of

preservation is replication”. • Copyright ? “May be a massive violation of copyright

law”. (Lawrence Lessig, Stanford University expert on IP law in Cyberspace)

Page 30: Archiving the Web: why bother ? LA Times (March 2000)

“Wayback machine” - http://www.archive.org

• Front end to the Internet Archive collection of public web pages

• Includes most image files in the collection

• Launched October 2001

• Fully available to public

• 20,000 users/day; up to 200 queries per second

• Not yet text searchable (URL search only)

• Financial sustainability ? (No advertising)

Page 31: Archiving the Web: why bother ? LA Times (March 2000)

Conclusion

• We’re not here to test laws. We’re trying to build a world we want to live in. The world without a library is a world without memory, and that would be tragic.” B. Kahle, October 2001.

• On the Web, anyone can be a publisher; now there is a library for their work.” B. Kahle, May 2002