web archiving @ the internet archive. agenda brief introduction to ia web archiving collection...

Post on 26-Dec-2015

221 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Web Archiving @ The Internet Archive

Agenda

• Brief Introduction to IA

• Web Archiving Collection Policies and Strategies

• Key Challenges (opportunities for broader collaboration…)

What is the Internet Archive?• A digital library established in 1996 that contains over four and a half petabytes (compressed) of publicly accessible digital archival material

• A 501(c)(3) non profit organization

• A technology partner to libraries, archives, museums, universities, research institutes, and memory institutions

Currently archiving books, texts, film, video, audio, images, software, educational content, television, and the Internet…

www.archive.org

Data Storage & Preservation

IA’s Web archive spans 1996-present & includes over 150 billion web instances

Develop freely available, open source, web archiving & access tools (Heritrix, Wayback, NutchWAX…)

Provide services that enable partners to drive their web archiving programs

Perform crawls & host collections for libraries, archives, universities, museums, & other memory institutions

www.archive.org/web/ www.archiveit.org

Today’s Landscape

“The current size of the world’s digital content is equivalent to all the information that could be stored on 75bn Apple iPads, or the amount [of data] that would be generated by everyone in the world posting messages on Twitter constantly for a century. ..”

SRc: UK Telegraph

http://www.telegraph.co.uk/technology/news/7675214/Zettabytes-overtake-petabytes-as-largest-unit-of-digital-measurement.html

IDC annual survey, released May 2010

Today’s Web Landscape

• Google: “seen well over 1 trillion unique URLs”

• Actual indexed pages: – tens of billions+ (~40-50bil?)– Cuil: “127 bil web pages” (July 15, 2010)

• Hundreds of millions of “sites”– Site: publishing network endpoint; One page to millions per site– Diversity of content – streamed, social, interactive…

Collection Policies & Strategies

• Crawl Strategies1) Broad, web-wide surveys from every domain, in every language, including media and text,

static and interactive interfaces

2) Organic link discovery at all levels of a host/site

3) End of life, exhaustive harvests

4) Selective/Thematic & resource-specific harvests

• Key Inputs: registry data, trusted directories, wikipedia, subject matter experts, prior crawl data

• Frequency: usually ongoing but at least Yrly…

Typical Challenges of Archiving the Web

• Harvests are at best samples– Time & expense: can’t get

everything– Rate of change: don’t get

every version– Rate of collection: issues of

‘time skew’

• User agents/ Protocols

10

Typical Challenges, cont.

• Publisher right to opt “in” or “out”– Content behind log-ins can not be archived w/o credentials

– Content can be blocked by robots.txt files (which our crawlers respect by default)

• Structure of the sites/urls make it very hard to capture only the content of interest. Each site has its own unique set of challenges.– Some parts of sites are not “archive-friendly” (i.e. complex javascript,

flash, etc.)

– These sites tend to change both their technical structure and policy quickly and often.

Challenges, cont.

Social networks and collaborative/semi-private spaces

Immersive Worlds

~70% of the world’s digital content is now generated by individuals

SRc: UK Telegraph, IDC annual survey, released May 2010

Web QA & Analysis

Daunting scale, requires multi-layered approach– Automated QA to identify missing files used to render

pages and prioritize URI’s for harvest

– Filtering of spam and content farms discovered during harvest and post harvest

– Randomized, representative, human critique of “in” vs “out” of scope per given legal mandate

– Advanced analyses: Web and link graphing, text mining

Key Challenges

• Not all data can be crawled, need diverse methods of data collection

• Data may be lost no matter how carefully it is managed – Need to keep multiple, distributed copies!

• Harvested data can be hard to make accessible in a compelling way, on an ongoing basis, at *every* scale

• Research and experimentation are essential to keep pace publisher innovation, partnerships are the only way to “keep up” & to support demands of ongoing operations

Key Challenges

• Manageable Costs/Sustainable Approaches– Access to power & other critical operational resources

– Sufficient processing capacity for collection, analysis, discovery, & dissemination of resources

– Support for on demand assembly of collections from aggregate data sets

– Timeliness of collection & access

• Intuitive interfaces for discovering & navigating resources over time, including robust APIs

• Recruitment of engineering talent• Funding

Thank You!

Kris Carpenter Negulescu

Director, Web Group

Internet Archive

kcarpenter [at] archive [dot] org

top related