web archiving @ the internet archive. agenda brief introduction to ia web archiving collection...

Web Archiving @ The Internet Archive

Agenda

• Brief Introduction to IA

• Web Archiving Collection Policies and Strategies

• Key Challenges (opportunities for broader collaboration…)

What is the Internet Archive?• A digital library established in 1996 that contains over four and a half petabytes (compressed) of publicly accessible digital archival material

• A 501(c)(3) non profit organization

• A technology partner to libraries, archives, museums, universities, research institutes, and memory institutions

Currently archiving books, texts, film, video, audio, images, software, educational content, television, and the Internet…

www.archive.org

http://www.archive.org/

Data Storage & Preservation

IA’s Web archive spans 1996-present & includes over 150 billion web instances

Develop freely available, open source, web archiving & access tools (Heritrix, Wayback, NutchWAX…)

Provide services that enable partners to drive their web archiving programs

Perform crawls & host collections for libraries, archives, universities, museums, & other memory institutions

www.archive.org/web/ www.archiveit.org

http://www.archive.org/web/

http://www.archiveit.org/

Today’s Landscape

“The current size of the world’s digital content is equivalent to all the information that could be stored on 75bn Apple iPads, or the amount [of data] that would be generated by everyone in the world posting messages on Twitter constantly for a century. ..”

SRc: UK Telegraph

http://www.telegraph.co.uk/technology/news/7675214/Zettabytes-overtake-petabytes-as-largest-unit-of-digital-measurement.html

IDC annual survey, released May 2010

Today’s Web Landscape

• Google: “seen well over 1 trillion unique URLs”

• Actual indexed pages: – tens of billions+ (~40-50bil?)– Cuil: “127 bil web pages” (July 15, 2010)

• Hundreds of millions of “sites”– Site: publishing network endpoint; One page to millions per site– Diversity of content – streamed, social, interactive…

Collection Policies & Strategies

• Crawl Strategies1) Broad, web-wide surveys from every domain, in every language, including media and text,

static and interactive interfaces

2) Organic link discovery at all levels of a host/site

3) End of life, exhaustive harvests

4) Selective/Thematic & resource-specific harvests

• Key Inputs: registry data, trusted directories, wikipedia, subject matter experts, prior crawl data

• Frequency: usually ongoing but at least Yrly…

Typical Challenges of Archiving the Web

• Harvests are at best samples– Time & expense: can’t get

everything– Rate of change: don’t get

every version– Rate of collection: issues of

‘time skew’

• User agents/ Protocols

10

Typical Challenges, cont.

• Publisher right to opt “in” or “out”– Content behind log-ins can not be archived w/o credentials

– Content can be blocked by robots.txt files (which our crawlers respect by default)

• Structure of the sites/urls make it very hard to capture only the content of interest. Each site has its own unique set of challenges.– Some parts of sites are not “archive-friendly” (i.e. complex javascript,

flash, etc.)

– These sites tend to change both their technical structure and policy quickly and often.

Challenges, cont.

Social networks and collaborative/semi-private spaces

Immersive Worlds

~70% of the world’s digital content is now generated by individuals

SRc: UK Telegraph, IDC annual survey, released May 2010

Web QA & Analysis

Daunting scale, requires multi-layered approach– Automated QA to identify missing files used to render

pages and prioritize URI’s for harvest

– Filtering of spam and content farms discovered during harvest and post harvest

– Randomized, representative, human critique of “in” vs “out” of scope per given legal mandate

– Advanced analyses: Web and link graphing, text mining

Key Challenges

• Not all data can be crawled, need diverse methods of data collection

• Data may be lost no matter how carefully it is managed – Need to keep multiple, distributed copies!

• Harvested data can be hard to make accessible in a compelling way, on an ongoing basis, at *every* scale

• Research and experimentation are essential to keep pace publisher innovation, partnerships are the only way to “keep up” & to support demands of ongoing operations

Key Challenges

• Manageable Costs/Sustainable Approaches– Access to power & other critical operational resources

– Sufficient processing capacity for collection, analysis, discovery, & dissemination of resources

– Support for on demand assembly of collections from aggregate data sets

– Timeliness of collection & access

• Intuitive interfaces for discovering & navigating resources over time, including robust APIs

• Recruitment of engineering talent• Funding

Thank You!

Kris Carpenter Negulescu

Director, Web Group

Internet Archive

kcarpenter [at] archive [dot] org

web archiving @ the internet archive. agenda brief introduction to ia web archiving collection...

Documents

web harvests

interactive slide

yrly slide

internet archive slide

web archiving programs

web instances

bil web pages

worlds digital content