web archiving projects end-user perspective

19
WEB ARCHIVING WEB ARCHIVING PROJECTS END-USER PROJECTS END-USER PERSPECTIVE PERSPECTIVE Bogdan Trifunovic, M. Bogdan Trifunovic, M. A. A. Digitization Center Digitization Center Public Library Cacak Public Library Cacak [email protected] [email protected] www.cacak-dis.rs www.cacak-dis.rs

Upload: bogdan-trifunovic

Post on 11-May-2015

1.043 views

Category:

Education


1 download

DESCRIPTION

A presentation about web archiving projects end-user perspective review, as well about web archiving in Serbia, presented at VIII National conference of National center for digitization, Belgrade, Serbia, April 16, 2009.

TRANSCRIPT

Page 1: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

WEB ARCHIVING WEB ARCHIVING PROJECTS END-USER PROJECTS END-USER

PERSPECTIVEPERSPECTIVE

Bogdan Trifunovic, M. A.Bogdan Trifunovic, M. A.Digitization CenterDigitization Center

Public Library CacakPublic Library [email protected]@cacak-dis.rs

www.cacak-dis.rswww.cacak-dis.rs

Page 2: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

The purpose of researchThe purpose of research

Examines usability and accessibility of Examines usability and accessibility of the publicly opened web archiving the publicly opened web archiving projectsprojects

Identifying user-friendly features Identifying user-friendly features associated with the web sites of associated with the web sites of several web archiving projects, but several web archiving projects, but also the creation of basic structure and also the creation of basic structure and framework for comparative analysis framework for comparative analysis

Raising awareness about web archivingRaising awareness about web archiving

Page 3: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

INTERNET ARCHIVEINTERNET ARCHIVE

http://www.archive.orghttp://www.archive.org Established in 1996 as non-profit Established in 1996 as non-profit

organization (private funding)organization (private funding) Oldest web archiving project, using Oldest web archiving project, using

Alexa crawler (robot) for creating the Alexa crawler (robot) for creating the snapshots of entire WWWsnapshots of entire WWW

The sheer size of Internet doesn’t The sheer size of Internet doesn’t allow capturing everything onlineallow capturing everything online

Page 4: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

Newer approachesNewer approaches

Mostly dealing with the “national” Mostly dealing with the “national” part of WWW (e.g. capturing and part of WWW (e.g. capturing and archiving national domain, digital archiving national domain, digital preservation of “web heritage”)preservation of “web heritage”)

Run by major national institutions Run by major national institutions (libraries, consortia)(libraries, consortia)

Selective approach of identifying Selective approach of identifying quality Internet content, which quality Internet content, which satisfies established standardssatisfies established standards

Page 5: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

Web Archiving projectsWeb Archiving projects

PANDORA (National Library of PANDORA (National Library of Australia)Australia)

EUROPEAN ARCHIVE (non-profit)EUROPEAN ARCHIVE (non-profit) MINERVA (Library of Congress)MINERVA (Library of Congress) UK WEB ARCHIVE (British Library)UK WEB ARCHIVE (British Library) WEBARCHIV (National Library of the WEBARCHIV (National Library of the

Czech Republic)Czech Republic)

*All projects were reviewed in November 2008*All projects were reviewed in November 2008

Page 6: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

PANDORAPANDORA

http://pandora.nla.gov.au/http://pandora.nla.gov.au/ PANDAS (PANDORA Digital Archiving PANDAS (PANDORA Digital Archiving

System)System) HTTrack crawlerHTTrack crawler Excellent documentation, easily to Excellent documentation, easily to

navigate and browse collectionsnavigate and browse collections Basic and advance search optionsBasic and advance search options Unlimited access to collectionsUnlimited access to collections

Page 7: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

EUROPEAN ARCHIVEEUROPEAN ARCHIVE

http://www.europarchive.org/http://www.europarchive.org/ New project, still in developmentNew project, still in development Web 2.0 elements (tag cloud, my Web 2.0 elements (tag cloud, my

Desktop)Desktop) Internet Archive harvesting servicesInternet Archive harvesting services No search options for web archive, No search options for web archive,

multilingual interfacemultilingual interface Unlimited accessUnlimited access

Page 8: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

MINERVAMINERVA

http://lcweb2.loc.gov/diglib/lcwa/http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.htmlhtml/lcwa-home.html

Harvest by Internet ArchiveHarvest by Internet Archive Thematic collections (US elections, Thematic collections (US elections,

war in Iraq, etc)war in Iraq, etc) Restrictions on access to some Restrictions on access to some

collections (only from LOC)collections (only from LOC)

Page 9: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

UK WEB ARCHIVEUK WEB ARCHIVE

http://www.webarchive.org.ukhttp://www.webarchive.org.uk Established 2003 by six institutions Established 2003 by six institutions

as UK Web Archive Consortium, as UK Web Archive Consortium, between 2005 and 2007 project had between 2005 and 2007 project had used PANDAS technology, from 2008 used PANDAS technology, from 2008 new web archiving system based on new web archiving system based on Web Curator Tool has been Web Curator Tool has been introducedintroduced

BL maintains project from 2008BL maintains project from 2008

Page 10: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

WEBARCHIVWEBARCHIV

http://www.webarchiv.cz/http://www.webarchiv.cz/ Heritrix crawlerHeritrix crawler Archiving Czech web domain, access Archiving Czech web domain, access

to collection of websites (900+) with to collection of websites (900+) with signed contracts for public access, signed contracts for public access, everything else only from NKPeverything else only from NKP

No search option except by URL, No search option except by URL, content not indexedcontent not indexed

Page 11: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

Why archiving webWhy archiving web

General idea is that changing nature General idea is that changing nature of WWW and instability of of WWW and instability of information on Internet should be information on Internet should be preserved in some way, because that preserved in some way, because that is part of national (digital) cultureis part of national (digital) culture

Preservation of online documents Preservation of online documents (e.g., for citation accuracy)(e.g., for citation accuracy)

Because there is huge growth of Because there is huge growth of online materialonline material

Page 12: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

DifficultiesDifficulties

There are three important There are three important characteristics of the Web that make characteristics of the Web that make crawling it very difficult:crawling it very difficult:• its large volume, its large volume, • its fast rate of change, and its fast rate of change, and • dynamic page generationdynamic page generation

Identifying web content that should Identifying web content that should be preserved for future – the role of be preserved for future – the role of librarians, curators, archivists…librarians, curators, archivists…

Page 13: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

Serbia caseSerbia case

The process of changing national The process of changing national domain from .yu to .rs domain has domain from .yu to .rs domain has started in 2008started in 2008

By October 2009 all of .yu content By October 2009 all of .yu content (everything with .yu address) will (everything with .yu address) will permanently disappear from WWWpermanently disappear from WWW

Thousands of web pages will be lostThousands of web pages will be lost There is no strategy of preserving There is no strategy of preserving

them (but also no time)them (but also no time)

Page 14: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

Planning on a small scalePlanning on a small scale

Public library Cacak-Digitization Public library Cacak-Digitization Center created a short list of about Center created a short list of about 50 web sites of interest for us50 web sites of interest for us

We used HTTrack We used HTTrack (http://www.httrack.com/) web (http://www.httrack.com/) web crawler to locally archive themcrawler to locally archive them

It is possible to navigate all websites, It is possible to navigate all websites, where harvesting process was where harvesting process was successfulsuccessful

Page 15: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE
Page 16: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE
Page 17: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE
Page 18: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

Future stepsFuture steps

Improving organizational framework Improving organizational framework for web archiving of local resourcesfor web archiving of local resources

Defining the legal setting – how to Defining the legal setting – how to download and archive authorized download and archive authorized materialmaterial

Finding solutions for automatic Finding solutions for automatic archiving (partially solving the archiving (partially solving the problem of staff shortages)problem of staff shortages)

Page 19: WEB ARCHIVING PROJECTS END-USER PERSPECTIVE

THANK YOU!THANK YOU!

QUESTIONS?QUESTIONS?

Bogdan Trifunovic, M. A.Bogdan Trifunovic, M. A.Digitization CenterDigitization Center

Public Library CacakPublic Library [email protected]@cacak-dis.rs

www.cacak-dis.rswww.cacak-dis.rs