data lessons learned at scale
DESCRIPTION
Data lessons learned at scale while building AddThis' data processing system.TRANSCRIPT
![Page 1: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/1.jpg)
Charlie ReverteVP Engineering
@numbakrrunch
Data Lessons Learned at Scale
![Page 2: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/2.jpg)
@numbakrrunch
Topic
Half of the work that it takes to do data science is plumbing and wrangling
Here are some lessons we’ve learned..
![Page 3: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/3.jpg)
@numbakrrunch
About AddThis
We make tools for websites
![Page 4: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/4.jpg)
@numbakrrunch
Our Data
We process website data● Visitation● Sharing● Following● Content Classification
And use it to improve the site● Content Recommendation● Personalization● Analytics
![Page 5: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/5.jpg)
@numbakrrunch
At Scale...
● 14 million domains● 100 billion views/month● 50k events/sec● 160k concurrent firewall sessions● 500k unique ganglia metrics
![Page 6: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/6.jpg)
@numbakrrunch
Distributed ID Generation
● Session IDs are generated in the browser● We concatenate time and a random value
Hex: 4f6934b6f54bd7c1
Base64: T2k0to403VS
● Time-bounded probabilistic uniqueness○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec)
● Naturally time ordered, built-in DoB
Compare to Twitter Snowflakehttps://github.com/twitter/snowflake/
time rand63 31 0
![Page 7: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/7.jpg)
@numbakrrunch
Counting Things
● Cardinality● Set membership● Top-k elements● Frequency
● Estimate when possible● Sample when possible● Often streaming vs. batch● Mergeability is a big plus
○ Distributed counting○ Checkpointing
Stream-lib: https://github.com/clearspring/stream-lib
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
![Page 8: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/8.jpg)
@numbakrrunch
Joining Data
● Value of data increases with higher dimensionality○ Geo, user profile, page attributes, external data
● Join and de-normalize data when you ingest○ Disk is cheap
● Join your data in client-side storage○ Browsers as a lossy distributed database
● Oceans of data in the cloud..
“The value is in the join” (or something like that)
https://github.com/stewartoallen
![Page 9: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/9.jpg)
@numbakrrunch
Sharding and Sampling
● Choose your shard keys wisely○ High cardinality field to reduce lumpiness○ What do you need to co-locate○ Storage is cheap, multiple copies?
● Shards also useful for sampling○ Complete data subsets
● Can yield statistical significance○ Depending on the question
![Page 10: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/10.jpg)
Deployment
● Continuous Deploy?● Deploying our javascript costs $3k
○ Have to invalidate 1.4B browser caches○ Several hours to flush to browsers (clench)
● 2PB of CDN data served per month● Have DDOSed ourselves
○ Very interesting bugs● Simulation is weak
○ The internet is a dirty place○ Embrace incremental deploys
![Page 11: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/11.jpg)
@numbakrrunch
The Log
Jay Kreps: “Real-time data’s unifying abstraction”● Centralized logging● Loosely coupled consumers
Divide your dependencies:● Synchronous - 0mq● Asynchronous - Kafka
Distributed event logging● Does determinism matter?
Log format durability?● Protobuf?
http://bit.ly/thelog
![Page 12: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/12.jpg)
@numbakrrunch
Columnar Compression
● Columnar storage techniques for row data● Better compressor efficiency● Different compressors per column● >20% size savings● https://github.com/addthis/columncompressor
○ by @abramsm
Time IP UID URL Geo Time
IP
UID
URL
Geo
Input Data Stored Data
Block Size
![Page 13: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/13.jpg)
@numbakrrunch
Tunable QoS
Cassandra URL Store
● We scrape and classify 20M URLs/day● 750 million active records● 2.2B reads/day● Variable cache TTLs
○ Depending on write rate per record
● Global TTL knob○ Turn up to reduce load for maintenance○ Turn down to improve responsiveness
6
CDN cache
![Page 14: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/14.jpg)
HydraOur custom processing systemOptimized for real-time data
Just open sourced:https://github.com/addthis/hydra
Go see @csby’s talkGreat Hall North @3:55pm
![Page 15: Data Lessons Learned at Scale](https://reader034.vdocuments.net/reader034/viewer/2022051609/547e948f5906b5a6718b4696/html5/thumbnails/15.jpg)
@numbakrrunch
Summary
● Are you more like the post office or the bank?● Look for good-enough answers● Fight your nerd tendency for perfect
○ I’m still struggling with this