web archiving @ the internet archive. agenda brief introduction to ia web archiving collection...
TRANSCRIPT
Web Archiving @ The Internet Archive
Agenda
• Brief Introduction to IA
• Web Archiving Collection Policies and Strategies
• Key Challenges (opportunities for broader collaboration…)
What is the Internet Archive?• A digital library established in 1996 that contains over four and a half petabytes (compressed) of publicly accessible digital archival material
• A 501(c)(3) non profit organization
• A technology partner to libraries, archives, museums, universities, research institutes, and memory institutions
Currently archiving books, texts, film, video, audio, images, software, educational content, television, and the Internet…
www.archive.org
Data Storage & Preservation
IA’s Web archive spans 1996-present & includes over 150 billion web instances
Develop freely available, open source, web archiving & access tools (Heritrix, Wayback, NutchWAX…)
Provide services that enable partners to drive their web archiving programs
Perform crawls & host collections for libraries, archives, universities, museums, & other memory institutions
www.archive.org/web/ www.archiveit.org
Today’s Landscape
“The current size of the world’s digital content is equivalent to all the information that could be stored on 75bn Apple iPads, or the amount [of data] that would be generated by everyone in the world posting messages on Twitter constantly for a century. ..”
SRc: UK Telegraph
http://www.telegraph.co.uk/technology/news/7675214/Zettabytes-overtake-petabytes-as-largest-unit-of-digital-measurement.html
IDC annual survey, released May 2010
Today’s Web Landscape
• Google: “seen well over 1 trillion unique URLs”
• Actual indexed pages: – tens of billions+ (~40-50bil?)– Cuil: “127 bil web pages” (July 15, 2010)
• Hundreds of millions of “sites”– Site: publishing network endpoint; One page to millions per site– Diversity of content – streamed, social, interactive…
Collection Policies & Strategies
• Crawl Strategies1) Broad, web-wide surveys from every domain, in every language, including media and text,
static and interactive interfaces
2) Organic link discovery at all levels of a host/site
3) End of life, exhaustive harvests
4) Selective/Thematic & resource-specific harvests
• Key Inputs: registry data, trusted directories, wikipedia, subject matter experts, prior crawl data
• Frequency: usually ongoing but at least Yrly…
Typical Challenges of Archiving the Web
• Harvests are at best samples– Time & expense: can’t get
everything– Rate of change: don’t get
every version– Rate of collection: issues of
‘time skew’
• User agents/ Protocols
10
Typical Challenges, cont.
• Publisher right to opt “in” or “out”– Content behind log-ins can not be archived w/o credentials
– Content can be blocked by robots.txt files (which our crawlers respect by default)
• Structure of the sites/urls make it very hard to capture only the content of interest. Each site has its own unique set of challenges.– Some parts of sites are not “archive-friendly” (i.e. complex javascript,
flash, etc.)
– These sites tend to change both their technical structure and policy quickly and often.
Challenges, cont.
Social networks and collaborative/semi-private spaces
Immersive Worlds
~70% of the world’s digital content is now generated by individuals
SRc: UK Telegraph, IDC annual survey, released May 2010
Web QA & Analysis
Daunting scale, requires multi-layered approach– Automated QA to identify missing files used to render
pages and prioritize URI’s for harvest
– Filtering of spam and content farms discovered during harvest and post harvest
– Randomized, representative, human critique of “in” vs “out” of scope per given legal mandate
– Advanced analyses: Web and link graphing, text mining
Key Challenges
• Not all data can be crawled, need diverse methods of data collection
• Data may be lost no matter how carefully it is managed – Need to keep multiple, distributed copies!
• Harvested data can be hard to make accessible in a compelling way, on an ongoing basis, at *every* scale
• Research and experimentation are essential to keep pace publisher innovation, partnerships are the only way to “keep up” & to support demands of ongoing operations
Key Challenges
• Manageable Costs/Sustainable Approaches– Access to power & other critical operational resources
– Sufficient processing capacity for collection, analysis, discovery, & dissemination of resources
– Support for on demand assembly of collections from aggregate data sets
– Timeliness of collection & access
• Intuitive interfaces for discovering & navigating resources over time, including robust APIs
• Recruitment of engineering talent• Funding
Thank You!
Kris Carpenter Negulescu
Director, Web Group
Internet Archive
kcarpenter [at] archive [dot] org