archiving the french web: the bnf web archiving workflow. sara aubry
DESCRIPTION
Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.TRANSCRIPT
Archiving the French Web: the BnF web archiving workflow
Sara Aubry Web Archiving Project Manager, IT department Bibliothèque nationale de France
International Conference on Web archives and e-LD
Biblioteca Nacional de España, Madrid, July 9th 2013
Let’s start with some figures
• Programme start in 2000, industrialisation in 2008-2012
• Collections: – 1996 - now
– 20 000 websites for focused crawls, 2.5 million .fr domains for broad crawls
– 18.8 billion URLs, 370 TB, growing up +100TB / year
• Resources: – 9 Full Time Employees (5 librarians, 4 engineers)
– many partners within and out of Library, both at the national and international level
– 70 robots (648GB RAM, 144 CPUs 2.4GHz)
Digital curation is not different!
• « Actions, tools and practices defined and applied to collect, identify, select, organize and preserve digital contents (…) in order to use them and make them available (…) »
Definition of Digital Archiving in Wikipedia
BnF workflow overview
Selecting
Collecting
Indexing
Accessing
Preserving
nas_preload
Selecting with BCWeb
Selecting with BCWeb
• A form-based application, commonly called a « curator tool » – for content curators and researchers to nominate
websites to harvest – giving basic information about them (content policies,
trends watch)
• Most important information for each website: – Internet address/URL – frequency (daily, monthly, yearly, once…) – size/budget (small, medium, big) – depth (entire domain, part of it) Content curators
The Web is made of HTML pages
1 HTML page, 48 URL • 1 HTML • 1 text/css • 4 javascript • 17 image/png • 5 image/jpeg • 21 image/gif all links and inclusions are URL references
Harvesting with Heritrix
• A harvester is a piece of software (crawler, spider, robot)
• Simulates what a person would do with a browser but repeatedly and very fast
• Follows a looping process
• Repeated until new and in-scope URL are found and limits are not reached (budget and time)
WARC
Pick a location
Make a Request
Receive a Response
Examine for references
Save the content
Assets: - open source - small and large scale - textual or all-media formats - data structures
Digital curators: legal deposit department
Engineers : IT department
Challenges: • rich media and ever-changing
environment • social networks • content beyond paywalls
(news sites, ebooks)
Piloting the crawls with NetarchiveSuite
• Prepare, schedule, run and monitor harvests of websites, perform QA
Digital curators: legal deposit department
Engineers : IT department
Offering access with Wayback
• Give readers the ability to browse the web “as it was” with: – a regular web browser – a search and redisplay
software • An application called
“Web archives” – Wayback: for URL search,
display and browsing – Nutch prototype for
keyword search – Guided paths for collection
highlights
Challenges: • links with our main Catalogue and
open data repository • “smart” URL search • full text search and indexing • small-scale data mining projects with
researchers
Questions ? E-mail: [email protected] Web site: http://www.bnf.fr Twitter: http://twitter.com/DLWebBnF