webrecorder.io building a new archiving service for everyone!
TRANSCRIPT
WebRecorder.io
Building a new archiving service for everyone!
What is WebRecorder.io?
On-Demand Archiving through the browser
What is WebRecorder.io?
On-Demand Archiving through the browser
What you see is what you archive (WYSIWYA)
What is WebRecorder.io?
On-Demand Archiving through the browser
What you see is what you archive (WYSIWYA)
Available to anyone!
What is WebRecorder.io?
On-Demand Archiving through the browser.
What you see is what you archive (WYSIWYA)
Available to anyone!
“Quality over Quantity” - High-Fidelity Replay of Web Content
Current Service
Proof-of-Concept
Current Service
Proof-of-Concept Users can record a page and browse
Current Service
Proof-of-Concept Users can record a page and browse Users can download the WARC after
browsing
Current Service
Proof-of-Concept Users can record a page and browse Users can download the WARC after
browsing Users can upload any WARC and replay
Current Service
Proof-of-Concept Users can record a page and browse Users can download the WARC after
browsing Users can upload any WARC and replay No content stored, WARCs deleted after
30 mins.
Current Service
Proof-of-Concept Users can record a page and browse Users can download the WARC after
browsing Users can upload any WARC and replay No content stored, WARCs deleted after
30 mins. Created last year as an experiment
Current Service
Proof-of-Concept Users can record a page and browse Users can download the WARC after browsing Users can upload a WARC and replay back No content stored, WARCs deleted after 30 mins. Created last year as an experiment
You can use at: https://webrecorder.io
New WebRecorder.io Service
First new version up at beta.webrecorder.io for demo
New WebRecorder.io Service
First new version up at beta.webrecorder.io for demo
Initially invite only to monitor capacity
New WebRecorder.io Service
First new version up at beta.webrecorder.io for demo
Initially invite only to monitor capacity User registration, login, individual
collections
New WebRecorder.io Service
First new version up at beta.webrecorder.io for demo
Initially invite only to monitor capacity User registration, login, individual
collections Collections available at
beta.webrecorder.io/<user>/<coll>.
New WebRecorder.io Service
First new version up at beta.webrecorder.io for demo
Initially invite only to monitor capacity User registration, login, individual collections Collections available at
beta.webrecorder.io/<user>/<coll> Collections can be private, public, or shared
privately (coming soon).
Live Demo!
Privacy Concerns
User responsible for their own archive, has full control
Privacy Concerns
User responsible for their own archive, has full control
Collections private by default, but users may choose what to make public
Privacy Concerns
User responsible for their own archive, has full control
Collections private by default, but users may choose what to make public
For now, WARCs downloadable only by owner, though may change.
Privacy Concerns
User responsible for their own archive, has full control
Collections private by default, but users may choose what to make public
For now, WARCs downloadable only by owner, though may change.
May have additional access levels: share read-only, share for recording, etc...
Privacy Concerns
User responsible for their own archive, has full control
Collections private by default, but users may choose what to make public
For now, WARCs downloadable only by owner, though may change.
May have additional access levels: share read-only, share for recording, etc...
Cookies: Cookies are recorded, but not replayed
Privacy Concerns
User responsible for their own archive, has full control Collections private by default, but users may choose
what to make public For now, WARCs downloadable only by owner, though
may change. May have additional access levels: share read-only,
share for recording, etc... Cookies: Cookies are recorded, but not replayed Looking for ideas/better ways to address privacy.
Suggestions welcome!
Goals/Features
Provide a flexible archiving service for high-fidelity web archiving.
Goals/Features
Provide a flexible archiving service for high-fidelity web archiving.
Customizable UI, metadata and annotation support.
Goals/Features
Provide a flexible archiving service for high-fidelity web archiving.
Customizable UI, metadata and annotation support.
On-Demand Full-Text Search.
Goals/Features
Provide a flexible archiving service for high-fidelity web archiving.
Customizable UI, metadata and annotation support.
On-Demand Full-Text Search. Multiple privacy options, custom sharing
settings.
Goals/Features
Provide a flexible archiving service for high-fidelity web archiving.
Customizable UI, metadata and annotation support.
On-Demand Full-Text Search. Multiple privacy options, custom sharing
settings. Multiple backends for storage.
Goals/Features Provide a flexible archiving service for high-fidelity
web archiving.
Customizable UI, metadata and annotation support.
On-Demand Full-Text Search. Multiple privacy options, custom sharing settings. Multiple backends for storage. A version that can also be hosted on custom
hardware, not in “the cloud”
Tools Used in WebRecorder.io
Built with open-source tools pywb – https://github.com/ikreymer/pywb
python wayback – Embedded in the web app, front end web service, handles url rewriting w/ custom rules, WARC reading, live web fetching.
warcprox – https://github.com/internetarchive/warcprox - Created by Noah Levitt of IA, HTTP/S proxy which records HTTP traffic to WARCs
Help Wanted!
Looking for collaborators, developers, UI designers, archivists
Help Wanted!
Looking for collaborators, developers, UI designers, archivists
If you ever wanted to participate in building an archiving service, here is your chance.
Help Wanted!
Looking for collaborators, developers, UI designers, archivists
If you ever wanted to participate in building an archiving service, here is your chance.
Sign-up for mailing list on webrecorder.io or request an invite at beta.webrecorder.io
Also can email [email protected]
Symmetrical Archiving – server and client side url rewriting for record and replay follow same path
Easy Part: HTML url rewritingHard part: JavaScript
Attempt to emulate original JS env as much as possible, customizable client-side hooks
Far from foolproof, Flash, Java applets still problematic.
Addendum: How It Works
Help Wanted!
Looking for collaborators, developers, UI designers, archivists
If you ever wanted to participate in building an archiving service, here is your chance.
Sign-up for mailing list on webrecorder.io or request an invite at beta.webrecorder.io
Also can email [email protected]
“Symmetrical Archiving”
User browses page through /record/ path → Page is recorded to WARC and indexed
User browses page through /replay/ path→ Page is replayed from WARC using index
Attempt symmetry in capture and replay as much as possible.
Assumption: Dynamic content generated for /record/ = Dynamic content generated for /replay/
“Symmetrical Archiving”
/<coll>/record/ path ↔ url rewriting system ↔fetch HTTP data ↔ recording proxy writes WARCs ↔ live web
/<coll>/ path ↔ url rewriting system ↔ fetch HTTP data ↔ read from WARC
Attempt symmetry in capture and replay as much as possible. Recorded content is instantly replayable.
“Symmetrical Archiving”
/<coll>/record/ path ↔ url rewriting system ↔fetch HTTP data ↔ recording proxy writes WARCs ↔ live web
/<coll>/ path ↔ url rewriting system ↔ fetch HTTP data ↔ read from WARC
Url rewriting is the hard part! Actually more like “emulating original page context”
when running through a proxy/recording.
“When symmetry breaks”
JavaScript generated content, “leaks” to live web
“When symmetry breaks”
JavaScript generated content, “leaks” to live web
Possible Solution: Extensive client side url-rewriting
“When symmetry breaks”
JavaScript generated content, “leaks” to live web
Possible Solution: Extensive client side url-rewriting
Checks for window.location or window.top
“When symmetry breaks”
JavaScript generated content, “leaks” to live web
Possible Solution: Extensive client side url-rewriting
Checks for window.location or window.top
“When symmetry breaks”
JavaScript generated content, “leaks” to live web
Possible Solution: Extensive client side url-rewriting
Checks for window.location or window.top
“When symmetry breaks”
Urls change based on timestamp, or date, eg. ?_=<timestamp>
Possible Solution: Override Date(), server-side “fuzzy matching” ignoring certain query params
Flash video in a custom flash SWF Possible Solution: may be able to force
html5, otherwise youtube-dl may download flash version, and replace with custom player (FlowPlayer)
“When symmetry breaks”
Urls change based on timestamp, or date, eg. ?_=<timestamp>
Possible Solution: Override Date(), server-side “fuzzy matching” ignoring certain query params
Flash video in a custom flash SWF Possible Solution: may be able to force html5,
otherwise youtube-dl may download flash version, and replace with custom player (FlowPlayer)
General black-box Flash content with hard-coded links. Possible Solution: No good one so far! Maybe
shumway.js, a javascript flash player from Mozilla?
“When symmetry breaks”
JavaScript generated content, “leaks” to live web
Possible Solution: Extensive client side url-rewriting
Checks for window.location or window.top Possible Solution: Rewrite
window.location → WB_wombat_location , window.top → WB_wombat_top
wombat.js rewriting libraryThe following are some of the possible overrides by wombat.js: AJAX (XmlHTTPRequest.open) window.open History.pushState / replaceState Object.defineProperty() overrides on: document.domain, document.cookie WB_wombat_location emulates to window.location with rewriting (with server-side
rewriting) WB_wombat_top emulate window.top but hides container frame (with server-side
rewriting) Window postMessage() Date() constructor Seed Math.random with capture time document.write() setAttribute() / or mutation observers appendChild() / replaceChild() / insertChild()
pywb
wombat.js is part of pywb, a new open source python “wayback machine” implementation
Optional custom rules can be specified for any site by prefix or regex, specified in yaml file.
Fuzzy matching rules: Specify significant query params No config file required! Out-of-the-box simple collection
management tools for running an archive More details at: https://github.com/ikreymer/pywb Future updates will include improvements to rule
customization.