lazy preservation: reconstructing websites by crawling the crawlers frank mccown, joan a. smith,...

24
Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University Norfolk, Virginia, USA Arlington, Virginia November 10, 2006 WIDM 2006

Upload: emerald-stephens

Post on 19-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

Lazy Preservation: Reconstructing Websites by

Crawling the Crawlers

Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen

Old Dominion UniversityNorfolk, Virginia, USA

Arlington, VirginiaNovember 10, 2006WIDM 2006

Page 2: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

2

Outline

• Web page threats• Web Infrastructure• Web caching experiment• Web repository crawling• Website reconstruction experiment

Page 3: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

3Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

Page 4: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

4

How much of the Web is indexed?

Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

Google

Yahoo

MSNIndexable

Web

8 billion pages

6.6 billion pages

5 billion pages

11.5 billion pages

Page 5: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

5

Page 6: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

6

Page 7: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

7

Cached Image

Page 8: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

Cached PDF

http://www.fda.gov/cder/about/whatwedo/testtube.pdf

MSN version Yahoo version Google version

canonical

Page 9: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

Web Repository CharacteristicsType MIME type File ext Google Yahoo MSN IA

HTML text text/html html C C C C

Plain text text/plain txt, ans M M M C

Graphic Interchange Format image/gif gif M M ~R C

Joint Photographic Experts Group

image/jpegjpg

M M ~R C

Portable Network Graphic image/png png M M ~R C

Adobe Portable Document Format

application/pdfpdf

M M M C

JavaScript application/javascript js M M C

Microsoft Excel application/vnd.ms-excel xls M ~S M C

Microsoft PowerPoint application/vnd.ms-powerpoint

pptM M M C

Microsoft Word application/msword doc M M M C

PostScript application/postscript ps M ~S C

C Canonical version is storedM Modified version is stored (modified images are thumbnails, all others are html conversions)~R Indexed but not retrievable~S Indexed but not stored

Page 10: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

10

Timeline of Web Resource

Page 11: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

11

Web Caching Experiment

• Create 4 websites composed of HTML, PDF, images– http://www.owenbrau.com/– http://www.cs.odu.edu/~fmccown/lazy/– http://www.cs.odu.edu/~jsmit/– http://www.cs.odu.edu/~mln/lazp/

• Remove pages each day

• Query GMY each day using identifiers

Page 12: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

12

Page 13: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

13

Page 14: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

14

Page 15: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

15

Page 16: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

16

Crawling the Web and web repositories

World Wide Web

Repo1

Repo2

Repon

...

Web crawling

Repo

Web-repository crawling

Page 17: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

17

• First developed in fall of 2005• Available for download at

http://www.cs.odu.edu/~fmccown/warrick/ • www2006.org – first lost website reconstructed

(Nov 2005)• DCkickball.org – first website someone else

reconstructed without our help (late Jan 2006)• www.iclnet.org – first website we reconstructed

for someone else (mid Mar 2006)• Internet Archive officially endorses Warrick (mid

Mar 2006)

Page 18: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

18

How Much Did We Reconstruct?

A

“Lost” web site Reconstructed web site

B C

D E F

A

B’ C’

G E

F

Missing link to D; points to old resource G

F can’t be found

Four categories of recovered resources:

1) Identical: A, E2) Changed: B, C3) Missing: D, F4) Added: G

Page 19: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

19

Reconstruction Diagram

added 20%

identical 50%

changed 33%

missing 17%

Page 20: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

20

Reconstruction Experiment

• Crawl and reconstruct 24 sites of various sizes:

1. small (1-150 resources) 2. medium (151-499 resources)3. large (500+ resources)

• Perform 5 reconstructions for each website– One using all four repositories together– Four using each repository separately

• Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)

Page 21: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

21Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.

Page 22: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

22

Recovery Success by MIME Type

Page 23: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

23

Repository Contributions

Page 24: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University

24

Current & Future Work

• Building a web interface for Warrick

• Currently crawling & reconstructing 300 randomly sampled websites each week– Move from descriptive model to proscriptive &

predictive model

• Injecting server-side functionality into WI– Recover the PHP code, not just the HTML