kris carpenter negulescu gordon paynter archiving the national web of new zealand
TRANSCRIPT
![Page 1: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/1.jpg)
Lessons Learned Archiving the National Web of New Zealand
Kris Carpenter NegulescuThe Internet Archive
Gordon PaynterThe National Library of New Zealand
Future Perfect 2012, 27 March 2012 , Wellington New Zealand
![Page 2: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/2.jpg)
Why collect the web?
![Page 3: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/3.jpg)
Legal deposit
• The National Library of New Zealand Act (2003)
• “Legal deposit” now includes “Internet documents”
• Available from http://legislation.govt.nz/
![Page 4: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/4.jpg)
Two web archiving programmes
Selective Harvesting of specific websites or parts websites
Domain Harvesting of the entire “New Zealand Internet”
http://topics.breitbart.com/fishing+pole/
http://www.trimarinegroup.com/operations/fleet.php
![Page 5: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/5.jpg)
Selective Web Archiving
![Page 6: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/6.jpg)
Selective web archiving
![Page 7: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/7.jpg)
Selective web archiving
![Page 8: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/8.jpg)
Selective web archiving
![Page 9: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/9.jpg)
Selective web archiving
National Library Beta, Voyager, Tapuhi, etc
Timeframes, Papers Past
Rosetta Access modules
including ArcViewer
Web Curator Tool
Digitisation & Sound
Preservation
Administration
Submission Tools Access Tools
Technology Infrastructure
Collection Management Systems
IAMS
cd ND...
Actor 1
cd ND...
Actor 1
cd ND...
Actor 1
cd ND...
Actor 1
cd ND...
Actor 1
cd ND...
Actor 1
cd ND...
Actor 1
Other Published &
Unpublished Material
NDHA(Rosetta)
![Page 10: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/10.jpg)
Selective web archiving
![Page 11: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/11.jpg)
Selective web archiving
![Page 12: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/12.jpg)
Selective web archiving
From January 2007: 14,182 harvests
• 83% Endorsed and Archived
• 17% Rejected or Aborted
• Using the Web Curator Tool
From 2000-2006: 441 harvests
• Some of multiple websites
• Using a desktop website capture tool
![Page 13: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/13.jpg)
New Zealand Web Harvests
October 2008April 2010
![Page 14: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/14.jpg)
New Zealand Web Harvests
• Scope
• Seeds
• Robots Policy
• Notification and communications
• How are we going to accomplish this?
• When are we going to stop?
![Page 15: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/15.jpg)
New Zealand Web Harvests
2008
• 17 days in October
• 106,184,620 URLs
• 4.6 Terabytes
• 397,000 hosts
• Seeds are known hosts
2010
• 24 days in April-May
• 131,770,485 URLs
• 6.9 Terabytes
• 559,000 hosts
• Seeds include .nz,
.com, .org and .net zone
files
![Page 16: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/16.jpg)
New Zealand Web Harvests
• Harvest analysis:– What exactly do we have?
– What’s a good harvest frequency?
• Preservation analysis:– ARC or WARC format?
– Should they be stored in the National Digital Heritage Archive?
• Public access analysis:– Ethical issues
– Privacy issues
– Legal and evidentiary value
– Copyright
![Page 17: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/17.jpg)
Challenges and Lessons
![Page 18: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/18.jpg)
Scope of a National Domain
• How is a national web domain defined?
– Hosts in the top-level domain or domains operated by registrars in country?
– Hosts known to be hosted on IP addresses within geographic boundaries?
– Content and advertising embedded in web sites published to the above
– Curator selected web sites, desitinations, or services considered to be within bounds of a country’s legislative or cultural heritage
![Page 19: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/19.jpg)
Scope of a National Domain
• New Zealand Web Harvest scope:– Hosts in the .nz top-level domain
– Hosts from .com, .org and .net that are physically in New Zealand
– A list of hosts known to be within the scope of the legislation
– Image, video clips, and other files that are embedded in web pages on the hosts above
• New Zealand Web Harvest seeds:– 2008: Gathered from the Library and the Internet Archive’s past crawls– 2010: Zone files for .nz, .com, .org and .net (plus 2008 hosts)
![Page 20: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/20.jpg)
Shape of harvest
• How broad or deep should the harvest be?
– Usually as broad as possible (survey of all resources at the highest levels)
– Usually deep enough to collect primary resources of interest and minimize unwanted, unrelated junk prevalent in any top level domain
![Page 21: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/21.jpg)
Shape of harvest
• New Zealand Web Harvest
– Up to 10,000 URLs from every host
– But up to 50,000 for .govt.nz and .ac.nz.
• On average, about 250 URLs (12 megabytes) per host
![Page 22: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/22.jpg)
Harvest Policies & Practices
• Robots Policy
– Respect robots.txt
– Ignore for embeds and inline content for unrestricted pages
• Notification
– Notifications may be sent to site owners/publishers prior to harvest
• Politeness settings
– Usually limit to load from a visitor navigating to the site via a browser
• Trade-off of harvest duration vs scale of resources
– Need to keep the data capture period brief
![Page 23: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/23.jpg)
Harvest Policies & Practices
• New Zealand Web Harvest Robots Policy
– Selective: Ignore robots.txt (usually)
– 2008: Ignore robots.txt (unless asked otherwise)
– 2010: Mostly honour robots.txt (following consultation)
• Four to six weeks of notification through many channels
![Page 24: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/24.jpg)
Harvest Infrastructure
• Dedicated crawlers to capture data– Service nodes for reporting and access; shared infrastructure for automated QA,
data mining and analysis
• Hardware: – Quad Core Processors (2.6 GHz)– 4-8 GB ram/core – 8+ Terabytes of local disk (Four 2-Terabyte SATA drives)
• Software:– Ubuntu Linux– Java(TM) SE Runtime Environment (latest build)– Heritrix 3 or v1.14.x
• Network:– Bandwidth is limited to ~300 Mbits/sec/project
![Page 25: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/25.jpg)
Harvest Infrastructure
In-house
• Possibly cheaper
• Large staff requirement
• Hardware requirements
• Network requirements
• Risks: what don’t we know?
Commissioned
• Higher outright cost
• Contractor provides
expertise: Heritrix, crawler
traps, scope, etc
• Contractor provides staff,
computers, bandwidth
The New Zealand Web Harvests were commissioned from the Internet Archive.
Unexpected issue: International bandwidth
![Page 26: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/26.jpg)
Challenges of All Web Archiving
• Not all data can be crawled
• Can publishers “opt in” or “opt out”?
• Data may be lost no matter how carefully it is managed
• Harvested data hard to make accessible – Intuitive interfaces for discovering and navigating resources
– With robust APIs
– All done in a compelling and sustainable way
• Research and experimentation are essential to keep pace with
publisher innovation
![Page 27: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/27.jpg)
Challenges of Domain Archiving
• Harvests are at best samples– Time & expense: can’t get everything– Rate of change: don’t get every version– Rate of collection: issues of ‘time skew’
• Choice of User agents/protocols– If you crawl as the Mozilla agent your content
may not redisplay in IE– Which mobile agents should you crawl as, if any?
• Site structure & publishing models– Some parts of sites are not “archive-friendly”
(JavaScript, AJAX, Flash, etc.)
– Change both their technical structure and policy quickly and often (YouTube, Facebook, etc)
![Page 28: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/28.jpg)
Challenges of Domain Archiving
Social networks and collaborative/semi-private spaces
Immersive Worlds
70+% of the world’s digital content is now generated by individuals – not all of it can be crawled
(UK Telegraph, IDC annual survey, released May 2010)
![Page 29: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/29.jpg)
Challenges of Domain Archiving
• Manageable Costs/Sustainable Approaches
– Access to power & other critical operational resources
– Sufficient processing capacity for collection, analysis, discovery, & dissemination of resources
– Bandwidth
• Recruitment and retention of staff/engineering expertise;
effective ongoing training
![Page 30: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/30.jpg)
Challenges of Domain Archiving
When do you stop crawling?
• The internet is infinitely large!
• Indicators that suggest diminishing returns have set in:
– A relatively small number of remaining hosts have a lot of depth
– More HTML than images appearing in the crawl log
– Higher incidence of crawler traps, content farms
• At this point we expect:
– We will capture proportionally more junk
– Website owners will complain that we're over-crawling
![Page 31: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/31.jpg)
Challenges of Domain Archiving
How do you assess the quality of a harvest?
• Quantitative measures of quality, breadth and depth
• Qualitative measures including characterization of resources and how
they fit with other collections
• Usually harvest for weeks in duration depending upon the desired
scope, and then run a “patch crawl”
![Page 32: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/32.jpg)
• Being responsive during a crawl
• New Zealand Web Harvest 2008:– 37 individual contacts during harvest
– 2 major mailing list discussions
– Blogs & Twitter
– Newspapers (“Library harvest costs website dear”) and radio
• A communications strategy and plan essential – The biggest difficulty is responding promptly outside working hours
Challenges of Domain Archiving
![Page 33: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/33.jpg)
Final thoughts?
What have we learned that is
particularly relevant to New Zealand?
![Page 34: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/34.jpg)
Final thoughts
• New Zealand faces the same challenges as our peers overseas
• Most of the world favours dedicated web archives– But we’re preserving web material alongside other formats.
• When will it be economical to harvest from New Zealand?
![Page 35: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand](https://reader034.vdocuments.net/reader034/viewer/2022050614/5557a373d8b42a696c8b4675/html5/thumbnails/35.jpg)
Final thoughts: how should national domain crawls work?
• Institutions crawl within their national domains from their own
national infrastructure
• Institutions share tools, metadata, knowledge and best practices– And to the extent possible – data!
– Collaboration will always achieve greater results than acting alone!
• Over the long term, shared goals and resources can help
mitigate economic and other barriers to collection, mining, and
access of New Zealand’s national digital heritage