web archiving projects end-user perspective
DESCRIPTION
A presentation about web archiving projects end-user perspective review, as well about web archiving in Serbia, presented at VIII National conference of National center for digitization, Belgrade, Serbia, April 16, 2009.TRANSCRIPT
WEB ARCHIVING WEB ARCHIVING PROJECTS END-USER PROJECTS END-USER
PERSPECTIVEPERSPECTIVE
Bogdan Trifunovic, M. A.Bogdan Trifunovic, M. A.Digitization CenterDigitization Center
Public Library CacakPublic Library [email protected]@cacak-dis.rs
www.cacak-dis.rswww.cacak-dis.rs
The purpose of researchThe purpose of research
Examines usability and accessibility of Examines usability and accessibility of the publicly opened web archiving the publicly opened web archiving projectsprojects
Identifying user-friendly features Identifying user-friendly features associated with the web sites of associated with the web sites of several web archiving projects, but several web archiving projects, but also the creation of basic structure and also the creation of basic structure and framework for comparative analysis framework for comparative analysis
Raising awareness about web archivingRaising awareness about web archiving
INTERNET ARCHIVEINTERNET ARCHIVE
http://www.archive.orghttp://www.archive.org Established in 1996 as non-profit Established in 1996 as non-profit
organization (private funding)organization (private funding) Oldest web archiving project, using Oldest web archiving project, using
Alexa crawler (robot) for creating the Alexa crawler (robot) for creating the snapshots of entire WWWsnapshots of entire WWW
The sheer size of Internet doesn’t The sheer size of Internet doesn’t allow capturing everything onlineallow capturing everything online
Newer approachesNewer approaches
Mostly dealing with the “national” Mostly dealing with the “national” part of WWW (e.g. capturing and part of WWW (e.g. capturing and archiving national domain, digital archiving national domain, digital preservation of “web heritage”)preservation of “web heritage”)
Run by major national institutions Run by major national institutions (libraries, consortia)(libraries, consortia)
Selective approach of identifying Selective approach of identifying quality Internet content, which quality Internet content, which satisfies established standardssatisfies established standards
Web Archiving projectsWeb Archiving projects
PANDORA (National Library of PANDORA (National Library of Australia)Australia)
EUROPEAN ARCHIVE (non-profit)EUROPEAN ARCHIVE (non-profit) MINERVA (Library of Congress)MINERVA (Library of Congress) UK WEB ARCHIVE (British Library)UK WEB ARCHIVE (British Library) WEBARCHIV (National Library of the WEBARCHIV (National Library of the
Czech Republic)Czech Republic)
*All projects were reviewed in November 2008*All projects were reviewed in November 2008
PANDORAPANDORA
http://pandora.nla.gov.au/http://pandora.nla.gov.au/ PANDAS (PANDORA Digital Archiving PANDAS (PANDORA Digital Archiving
System)System) HTTrack crawlerHTTrack crawler Excellent documentation, easily to Excellent documentation, easily to
navigate and browse collectionsnavigate and browse collections Basic and advance search optionsBasic and advance search options Unlimited access to collectionsUnlimited access to collections
EUROPEAN ARCHIVEEUROPEAN ARCHIVE
http://www.europarchive.org/http://www.europarchive.org/ New project, still in developmentNew project, still in development Web 2.0 elements (tag cloud, my Web 2.0 elements (tag cloud, my
Desktop)Desktop) Internet Archive harvesting servicesInternet Archive harvesting services No search options for web archive, No search options for web archive,
multilingual interfacemultilingual interface Unlimited accessUnlimited access
MINERVAMINERVA
http://lcweb2.loc.gov/diglib/lcwa/http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.htmlhtml/lcwa-home.html
Harvest by Internet ArchiveHarvest by Internet Archive Thematic collections (US elections, Thematic collections (US elections,
war in Iraq, etc)war in Iraq, etc) Restrictions on access to some Restrictions on access to some
collections (only from LOC)collections (only from LOC)
UK WEB ARCHIVEUK WEB ARCHIVE
http://www.webarchive.org.ukhttp://www.webarchive.org.uk Established 2003 by six institutions Established 2003 by six institutions
as UK Web Archive Consortium, as UK Web Archive Consortium, between 2005 and 2007 project had between 2005 and 2007 project had used PANDAS technology, from 2008 used PANDAS technology, from 2008 new web archiving system based on new web archiving system based on Web Curator Tool has been Web Curator Tool has been introducedintroduced
BL maintains project from 2008BL maintains project from 2008
WEBARCHIVWEBARCHIV
http://www.webarchiv.cz/http://www.webarchiv.cz/ Heritrix crawlerHeritrix crawler Archiving Czech web domain, access Archiving Czech web domain, access
to collection of websites (900+) with to collection of websites (900+) with signed contracts for public access, signed contracts for public access, everything else only from NKPeverything else only from NKP
No search option except by URL, No search option except by URL, content not indexedcontent not indexed
Why archiving webWhy archiving web
General idea is that changing nature General idea is that changing nature of WWW and instability of of WWW and instability of information on Internet should be information on Internet should be preserved in some way, because that preserved in some way, because that is part of national (digital) cultureis part of national (digital) culture
Preservation of online documents Preservation of online documents (e.g., for citation accuracy)(e.g., for citation accuracy)
Because there is huge growth of Because there is huge growth of online materialonline material
DifficultiesDifficulties
There are three important There are three important characteristics of the Web that make characteristics of the Web that make crawling it very difficult:crawling it very difficult:• its large volume, its large volume, • its fast rate of change, and its fast rate of change, and • dynamic page generationdynamic page generation
Identifying web content that should Identifying web content that should be preserved for future – the role of be preserved for future – the role of librarians, curators, archivists…librarians, curators, archivists…
Serbia caseSerbia case
The process of changing national The process of changing national domain from .yu to .rs domain has domain from .yu to .rs domain has started in 2008started in 2008
By October 2009 all of .yu content By October 2009 all of .yu content (everything with .yu address) will (everything with .yu address) will permanently disappear from WWWpermanently disappear from WWW
Thousands of web pages will be lostThousands of web pages will be lost There is no strategy of preserving There is no strategy of preserving
them (but also no time)them (but also no time)
Planning on a small scalePlanning on a small scale
Public library Cacak-Digitization Public library Cacak-Digitization Center created a short list of about Center created a short list of about 50 web sites of interest for us50 web sites of interest for us
We used HTTrack We used HTTrack (http://www.httrack.com/) web (http://www.httrack.com/) web crawler to locally archive themcrawler to locally archive them
It is possible to navigate all websites, It is possible to navigate all websites, where harvesting process was where harvesting process was successfulsuccessful
Future stepsFuture steps
Improving organizational framework Improving organizational framework for web archiving of local resourcesfor web archiving of local resources
Defining the legal setting – how to Defining the legal setting – how to download and archive authorized download and archive authorized materialmaterial
Finding solutions for automatic Finding solutions for automatic archiving (partially solving the archiving (partially solving the problem of staff shortages)problem of staff shortages)
THANK YOU!THANK YOU!
QUESTIONS?QUESTIONS?
Bogdan Trifunovic, M. A.Bogdan Trifunovic, M. A.Digitization CenterDigitization Center
Public Library CacakPublic Library [email protected]@cacak-dis.rs
www.cacak-dis.rswww.cacak-dis.rs