aleph archives iipc presentation
TRANSCRIPT
WEB ARCHIVING BUCKETIIPC — Crawl Meeting
Aleph Archives
TorontoOctober 3rd, 2012
Marco RoyProject & Channel [email protected]
1Tuesday, October 2, 12
Web Archiving BucketA set of tools to simplify Web Archiving
© ALEPH ARCHIVES 2012
Sep. 5th, 2012WSDK 1.0.0
Dec. 24th, 2012Cobalt 1.0.0
RELEASES TIMELINE
2Tuesday, October 2, 12
© ALEPH ARCHIVES 2012
Web Archiving Bucket
Software characteristics
User-Friendly
Up-To-Date
Highly Optimized
No Compromise
Cross-Platform (Windows, Linux, etc.): 32/64-bit
Adapted from Aleph’s Production Code
3Tuesday, October 2, 12
Warc SoftwareDevelopment Kit
API for building Web Archiving software
© ALEPH ARCHIVES 20124Tuesday, October 2, 12
© ALEPH ARCHIVES 2012
Web Archiving Bucket
WSDK
Key Benefits
Time and Resource saving
Ready for Cloud development
Build Highly Scalable software
5Tuesday, October 2, 12
© ALEPH ARCHIVES 2012
Web Archiving Bucket
WSDK
Technical Specifications
Multi-Core aware
Robust Networking Stack
Tersness (~300Ko): fast lexers, parsers, etc.
Carefully designed algorithms#mod LoC
Erlang 17 2832
C 2 623
6Tuesday, October 2, 12
© ALEPH ARCHIVES 2012
Web Archiving Bucket
Multi-Core Speed Test
1-Core 4-Core
CountTime
103.7 sec. 69.9 sec.
Architecture x86_64CPU op-mode 32-bit, 64-bitCPU(s) 4On-line CPU(s) list 0-3Thread(s) per core 1Core(s) per socket 4Socket(s) 1Vendor ID GenuineIntelCPU family 6Model 23Stepping 7CPU MHz 2499.876BogoMIPS 4999.75Virtualisation VT-xL1d cache 32KL1i cache 32KL2 cache 4096K
4 WARCshttp://archive.org/details/testWARC!les
4 WARCshttp://archive.org/details/testWARC!les
WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz 970MB
WIDE-20110225184020081-04372-13730~crawl301.us.archive.org~9443.warc.gz 957MB
WIDE-20110225210142891-04382-13730~crawl301.us.archive.org~9443.warc.gz 956MB
WIDE-20110225221304846-04388-13730~crawl301.us.archive.org~9443.warc.gz 969MB
Lin
ux S
erv
er
VIDEO
7Tuesday, October 2, 12
© ALEPH ARCHIVES 2012
Web Archiving Bucket
WSDK
Next release
Remote WARC manipulation API (REST)
WARC Writing Proxy
8Tuesday, October 2, 12
COBALTWeb Archives Playback Cluster
© ALEPH ARCHIVES 2012
cobalt 01 cobalt 02 cobalt 03
cobalt 08 cobalt 09 cobalt 10
...
...
cobaltload balancer
9Tuesday, October 2, 12
© ALEPH ARCHIVES 2012
Web Archiving Bucket
COBALT
Key Benefits
No Configuration
No Single Point of Failure (SPOF)
Fast Web Archives Access
10Tuesday, October 2, 12
© ALEPH ARCHIVES 2012
Web Archiving Bucket
COBALT
Technical Specifications
Clustered architecture
Automatic Resources Discovery
Playback Proxy by default
Modern and fast WARCs indexer
11Tuesday, October 2, 12
aleph-archives.com
☛ webarchivingbucket.com
12Tuesday, October 2, 12