lazy preservation, warrick, and the web infrastructure frank mccown old dominion university computer...
TRANSCRIPT
![Page 1: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/1.jpg)
Lazy Preservation, Warrick, and the Web Infrastructure
Frank McCown
Old Dominion UniversityComputer Science Department
Norfolk, Virginia, USA
JCDL 2007Vancouver, BCJune 19, 2007
![Page 2: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/2.jpg)
2
Outline
• What is the Web Infrastructure (WI)?• How can the WI be used for preservation?• Web-repository crawling with Warrick• Understanding the WI
– Caching experiment– Reconstruction experiments– Search engine sampling and IA overlap experiment
• Recovering web server components from the WI• Brass: Queueing manager for Warrick
![Page 3: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/3.jpg)
3
![Page 4: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/4.jpg)
4
Web Infrastructure
![Page 5: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/5.jpg)
5
Alternative Models of Preservation
• Lazy Preservation– Let Google, IA et al. preserve your website
• Just-In-Time Preservation– Wait for it to disappear first, then a “good enough”
version
• Shared Infrastructure Preservation– Push your content to sites that might preserve it
• Web Server Enhanced Preservation– Use Apache modules to create archival-ready
resources
![Page 6: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/6.jpg)
6
![Page 7: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/7.jpg)
7Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg
![Page 8: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/8.jpg)
8
Crawling the Crawlers
World Wide Web
Repo1
Repo2
Repon
...
Web crawling
Repo
Web-repository crawling
![Page 9: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/9.jpg)
9
![Page 10: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/10.jpg)
10
![Page 11: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/11.jpg)
11
Cached Image
![Page 12: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/12.jpg)
Cached PDF
http://www.fda.gov/cder/about/whatwedo/testtube.pdf
MSN version Yahoo version Google version
canonical
![Page 13: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/13.jpg)
13
Web-repository Crawler
![Page 14: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/14.jpg)
14
• McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.
• McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.
• McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006.
• McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.
Available at http://warrick.cs.odu.edu/
![Page 15: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/15.jpg)
15
What Types of Websites Are Lost?
Marshall, McCown, and Nelson, Evaluating Personal Archiving Strategies for Internet-based Information, IS&T Archiving 2007.
![Page 16: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/16.jpg)
16
Outline
• What is the Web Infrastructure (WI)?• How can the WI be used for preservation?• Web-repository crawling with Warrick• Understanding the WI
– Caching experiment– Reconstruction experiments– Search engine sampling and IA overlap experiment
• Recovering web server components from the WI• Brass: Queueing manager for Warrick
![Page 17: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/17.jpg)
17
Understanding the WI
• How quickly do search engines acquire and purge their caches?
• Do search engines prefer caching one type of resource over another?
• How much overlap is there between the search engines caches and IA holdings?
• How successfully can we reconstruct a lost website?
• Are some resources more recoverable than others?
![Page 18: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/18.jpg)
18
Timeline of Web Resource
![Page 19: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/19.jpg)
19
Web Caching Experiment
• Create 4 websites composed of HTML, PDFs, and images– http://www.owenbrau.com/– http://www.cs.odu.edu/~fmccown/lazy/– http://www.cs.odu.edu/~jsmit/– http://www.cs.odu.edu/~mln/lazp/
• Remove pages each day
• Query GMY every day using identifiers
McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.
![Page 20: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/20.jpg)
20
![Page 21: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/21.jpg)
21
![Page 22: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/22.jpg)
22
![Page 23: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/23.jpg)
23
![Page 24: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/24.jpg)
24
Where is the Internet Archive?
• No crawls from Alexa, IA’s provider
• Even if they had crawled us, the content would not be accessible from IA for 6-12 months
• Short-lived web content is likely to be lost for good
![Page 25: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/25.jpg)
25
2005 Reconstruction Experiment
• Crawl and reconstruct 24 sites of various sizes:
1. small (1-150 resources) 2. medium (151-499 resources)3. large (500+ resources)
• Perform 5 reconstructions for each website– One using all four repositories together– Four using each repository separately
• Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)
![Page 26: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/26.jpg)
26
How Much Did We Reconstruct?
A
“Lost” web site Reconstructed web site
B C
D E F
A
B’ C’
G E
F
Missing link to D; points to old resource G
F can’t be found
Four categories of recovered resources:
1) Identical: A, E2) Changed: B, C3) Missing: D, F4) Added: G
![Page 27: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/27.jpg)
27
Reconstruction Diagram
added 20%
identical 50%
changed 33%
missing 17%
![Page 28: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/28.jpg)
28
Recovery Success by MIME Type
![Page 29: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/29.jpg)
29
Repository Contributions
![Page 30: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/30.jpg)
30
2006 Reconstruction Experiment
• 300 websites chosen randomly from Open Directory Project (dmoz.org)
• Crawled and reconstructed each website every week for 14 weeks
• Examined change rates, age, decay, growth, recoverability
McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.
![Page 31: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/31.jpg)
31
Success of website recovery each week
*On average, we recovered 61% of a website on any given week.
![Page 32: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/32.jpg)
32
![Page 33: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/33.jpg)
33
Statistics for Repositories
![Page 34: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/34.jpg)
34
Experiment: Sample Search Engine Caches
• Feb 2006
• Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo
• Randomly selected 1 result from first 100
• Download resource and cached page
• Check for overlap with Internet Archive
McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.
![Page 35: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/35.jpg)
35
Distribution of Top Level Domains
![Page 36: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/36.jpg)
36
Cached Resource Size Distributions
976 KB 977 KB
1 MB 215 KB
![Page 37: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/37.jpg)
37
Cache Freshness
crawled and cached
changed on web server
crawled and cached
Stale
time
Fresh Fresh
Staleness = max(0, Last-modified http header – cached date)
![Page 38: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/38.jpg)
38
Cache Staleness
• 46% of resource had Last-Modified header
• 71% also had cached date
• 16% were at least 1 day stale
![Page 39: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/39.jpg)
39
Similarity vs. Staleness
![Page 40: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/40.jpg)
40
How much of the Web is indexed?
Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)
Yahoo
MSNIndexable
Web
8 billion pages
6.6 billion pages
5 billion pages
11.5 billion pages
Internet Archive?
![Page 41: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/41.jpg)
41
Overlap with Internet Archive
![Page 42: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/42.jpg)
42
Overlap with Internet Archive
![Page 43: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/43.jpg)
43
Distribution of Sampled URLs
![Page 44: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/44.jpg)
44
Problem:
WI currently only stores the client-side representation of a website. Server components (scripts, databases, configuration files, etc.) are not
accessible from the WI
![Page 45: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/45.jpg)
45
Outline
• What is the Web Infrastructure (WI)?• How can the WI be used for preservation?• Web-repository crawling with Warrick• Understanding the WI
– Caching experiment– Reconstruction experiments– Search engine sampling and IA overlap experiment
• Recovering web server components from the WI• Brass: Queueing manager for Warrick
![Page 46: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/46.jpg)
46
Database
Perlscript
config
Static files (html files, PDFs,
images, style sheets, Javascript, etc.)
Web Infrastructure
Web Infrastructure
Web Server
Dynamicpage
Recoverable
Not Recoverable
![Page 47: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/47.jpg)
47
Injecting Server Components into Crawlable Pages
Erasure codesHTML pages Recover at least
m blocks
![Page 48: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/48.jpg)
48
Brass: A Queueing Manager for Warrick
• Warrick requires some technical expertise to download, install, and run
• Warrick uses search engine APIs which allow limited requests per IP address (or key)
• Google no longer provides new keys for accessing their API
![Page 49: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/49.jpg)
49
![Page 50: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/50.jpg)
50
![Page 51: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/51.jpg)
51
Thank You
Frank McCown
[email protected]://www.cs.odu.edu/~fmccown/
Can’t wait until I’m old enough to
recover a website!
![Page 52: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/52.jpg)
52
Cache Freshness
crawled and cached
changed on web server
crawled and cached
Stale
time
Fresh Fresh
Staleness = max(0, Last-modified http header – cached date)
![Page 53: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/53.jpg)
53
Cache Staleness
• 46% of resource had Last-Modified header
• 71% also had cached date
• 16% were at least 1 day stale
![Page 54: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/54.jpg)
54
Similarity vs. Staleness
![Page 55: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/55.jpg)
56
Web Repository CharacteristicsType MIME type File ext Google Yahoo Live IA
HTML text text/html html C C C C
Plain text text/plain txt, ans M M M C
Graphic Interchange Format image/gif gif M M M C
Joint Photographic Experts Group
image/jpegjpg
M M M C
Portable Network Graphic image/png png M M M C
Adobe Portable Document Format
application/pdfpdf
M M M C
JavaScript application/javascript js M M C
Microsoft Excel application/vnd.ms-excel xls M ~S M C
Microsoft PowerPoint application/vnd.ms-powerpoint
pptM M M C
Microsoft Word application/msword doc M M M C
PostScript application/postscript ps M ~S C
C Canonical version is storedM Modified version is stored (modified images are thumbnails, all others are html conversions)~S Indexed but not stored
![Page 56: Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007](https://reader035.vdocuments.net/reader035/viewer/2022062305/5697bfb51a28abf838c9d799/html5/thumbnails/56.jpg)
57
Results
Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.