integrating multiple cdn providers at etsy - velocity europe (london) 2013
DESCRIPTION
Relying on a single content delivery network for your site can impose a number of flexibility limitations. By diversifying your CDN providers you can put the power back in your hands, allowing you to get the best of both worlds in terms of performance, reliability and cost. In this talk Marcus and Laurie will present Etsy’s recent work integrating multiple CDN providers to their site delivery infrastructure. This presentation was delivered at Velocity Europe, November 2013TRANSCRIPT
Integrating Multiple CDN ProvidersOur experiences at Etsy
@lozzd • @ickymettle
Marcus Barczak Laurie Denness
Staff Operations Engineers
@lozzd • @ickymettle
@lozzd • @ickymettle
@lozzd • @ickymettle
Beginning of 2010 Today
@lozzd • @ickymettle
Background▪ First started using a single CDN in 2008
▪ Exponential Growth
▪ Start of 2012 began investigation into running multiple CDNs
@lozzd • @ickymettle
Why use a CDN?▪ Goal: Consistently fast user experience globally
▪ Improve last mile performance by caching content close to the user
▪ Offload content delivery from origin infrastructure to the CDN provider
@lozzd • @ickymettle
Why use more than one CDN?▪ Resilience
- Eliminate single point of failure
▪ Flexibility- Balance traffic based on business requirements
▪ Cost- Manage provider costs
The Plan
http://www.flickr.com/photos/malloy/195204215
@lozzd • @ickymettle
The Plan1. Establish evaluation criteria
2. Initial configuration and testing
3. Test with production traffic
4. Operationalising
@lozzd • @ickymettle
Evaluation Criteria
http://www.flickr.com/photos/49212595@N00/5646403386
@lozzd • @ickymettle
Evaluation Criteria▪ Performance
▪ Configuration
▪ Reporting, Metrics and Logging
▪ Culture
@lozzd • @ickymettle
Performance▪ Baseline Response Times
- Should be within ±5% of our existing CDN provider’s response times
▪ Hit Ratios and Origin Offload - Provider should achieve equivalent or better origin offload
performance and hit ratios
@lozzd • @ickymettle
Configuration▪ Complexity
- how complex is the providers configuration system
▪ Self service- can you make changes directly or do they require
professional services or other intervention
▪ Latency for changes- how quickly do changes take to propagate
@lozzd • @ickymettle
Reporting, Metrics and Logging▪ Resolution
▪ Latency
▪ Delivery
▪ Customisation
@lozzd • @ickymettle
Culture▪ Understand our culture
▪ Postmortems
▪ Access to technical staff
▪ Shared success
Initial Configuration
and Testing
http://www.flickr.com/photos/7269902@N07/4592239326
Clean the househttp://www.flickr.com/photos/mastergeorge/8562623590
@lozzd • @ickymettle
Clean the house▪ Managing caching TTLs from origin
- CDNs honour the origin cache-control headers!
<LocationMatch "\.(gif|jpg|jpeg|png|css|js)$"> Header set Cache-Control "max-age=94670800"</LocationMatch>
@lozzd • @ickymettle
Clean the house▪ Manage gzip compression from origin
- Honoured by CDNs
- Compression from origin to CDN
## mod_deflate compression - see OPS-1537 ##AddOutputFilterByType DEFLATE text/html text/plain text/css application/x-javascript [..]
@lozzd • @ickymettle
Clean the house
If you can do it at origin,do it at origin
Mean Time To Curlhttp://www.flickr.com/photos/wwarby/3297205226
HTTP/1.1 200 OKServer: ApacheLast-Modified: Sat, 09 Nov 2013 23:43:38 GMTCache-Control: max-age=94670800[...]X-Served-By: cache-lo82-LHRX-Cache: MISSX-Cache-Hits: 0
curl -i -H 'Host: img0.etsystatic.com' \ global-ssl.fastly.net/someimage.jpg
HTTP/1.1 200 OKServer: ApacheLast-Modified: Sat, 09 Nov 2013 23:43:38 GMTCache-Control: max-age=94670800[...]X-Served-By: cache-lo82-LHRX-Cache: HITX-Cache-Hits: 1
curl -i -H 'Host: img0.etsystatic.com' \ global-ssl.fastly.net/someimage.jpg
https://www.etsy.com/listing/99871278
Mean Time To Curl = Done
@lozzd • @ickymettle
Mean Time To Curl▪ No need to touch existing infrastructure
▪ Smoke test of functionality
▪ 10 minutes from configuration to curl
▪ New providers should be plug and play
Testing In Productionhttp://www.flickr.com/photos/solarnu/10646426865
@lozzd • @ickymettle
Testing with Production Traffic▪ Images only at first
▪ Good test of caching performance
▪ Easy to test by swapping hostnames
▪ Made even easier with our A/B testing framework
@lozzd • @ickymettle
A/B Test Framework▪ Fine grained control
▪ Enable test for specific users or groups
▪ Percentage of users
▪ All controlled via configuration in code
▪ Rapid and complete rollback
@lozzd • @ickymettle
Configure Mappings to CDNs$server_config["image"] = array( 'akamai' => array( 'img0-ak.etsystatic.com', 'img1-ak.etsystatic.com', ), 'edgecast' => array( 'img0-ec.etsystatic.com', 'img1-ec.etsystatic.com', ), 'fastly' => array( 'img0-f.etsystatic.com', 'img1-f.etsystatic.com', ),);
@lozzd • @ickymettle
Test Controls
$server_config['ab']['cdn'] = array( 'enabled' => 'on', 'weights' => array( 'akamai' => 0.0, 'edgecast' => 0.0, 'fastly' => 0.0, 'origin' => 100.0, ), 'override' => 'cdn_diversity',);
@lozzd • @ickymettle
Metrics and Monitoring
http://www.flickr.com/photos/nicolasfleury/6073151084
@lozzd • @ickymettle
Metrics and Monitoring
Even if it doesn’t move, graph it anyway
@lozzd • @ickymettle
Simplest approach: Provider’s dashboards
Metrics and Monitoring
@lozzd • @ickymettle
▪ Get more detail by pulling metrics in house
▪ Write script to pull data from API
▪ Create dashboards with data
Metrics and Monitoring
@lozzd • @ickymettle
▪ Get more detail by pulling metrics in house
▪ Write script to pull data from API
▪ Create dashboards with data
Metrics and Monitoring
@lozzd • @ickymettle
Metrics and Monitoring
@lozzd • @ickymettle
Metrics and Monitoring
@lozzd • @ickymettle
Testing Plan1. for c in $cdns; do rampup $c; done;
2. Deliberately slow and steady
3. Watch traffic increase
4. Watch origin offload increase
5. Watch performance
@lozzd • @ickymettle
Downsides of this approach▪ AB testing can’t be used for main site
▪ Exposing your test CNAMEs
▪ Especially if hotlinking is a concern
@lozzd • @ickymettle
Downsides of this approach▪ Exposing your test CNAMEs
▪ Especially if hotlinking is a concern
@lozzd • @ickymettle
How do you know it’s broke? ▪ Check the graphs!
▪ Check with your community
▪ Keep support in the loop
Operationalising
http://www.flickr.com/photos/98047351@N05/9706165200
@lozzd • @ickymettle
Content Partitioning
@lozzd • @ickymettle
Etsy’s site partitioning
Dynamic HTML Contentwww.etsy.com
@lozzd • @ickymettle
Etsy’s site partitioning
Static Assets (js, css, fonts)site.etsystatic.com
@lozzd • @ickymettle
Etsy’s site partitioning
Listing Images, AvatarsimgX.etsystatic.com
@lozzd • @ickymettle
Etsy’s site partitioning
Listing Images, AvatarsimgX.etsystatic.com
Static Assets (js, css, fonts)site.etsystatic.com
Dynamic HTML Contentwww.etsy.com
Balancing Traffic in Production
http://www.flickr.com/photos/wok_design/2499217405
@lozzd • @ickymettle
Balancing Traffic Using DNS▪ Traffic Manager
▪ Extends DNS to dynamically return records based on rules
▪ Weighted round robin
@lozzd • @ickymettle
Balancing Traffic Using DNS
[2589:~] $ dig +short www.etsy.comwww.etsy.com.edgekey.net.e2463.b.akamaiedge.net.23.74.122.37
[2589:~] $ dig +short www.etsy.comcs34.adn.edgecastcdn.net.93.184.219.54[2589:~] $ dig +short www.etsy.comglobal-ssl.fastly.net.185.31.19.184
[2589:~] $ dig +short www.etsy.cometsy.com.38.123.123.123
@lozzd • @ickymettle
Balancing Traffic Using DNS
[2589:~] $ dig +short www.etsy.comwww.etsy.com.edgekey.net.e2463.b.akamaiedge.net.23.74.122.37
[2589:~] $ dig +short www.etsy.comcs34.adn.edgecastcdn.net.93.184.219.54
[2589:~] $ dig +short www.etsy.comglobal-ssl.fastly.net.185.31.19.184
[2589:~] $ dig +short www.etsy.cometsy.com.38.123.123.123
@lozzd • @ickymettle
Balancing Traffic Using DNS▪ Rule updates typically made via web UI
▪ Can be slow and error prone
▪ Changes need to be applied to all three domains
▪ API available to make changes programmatically
@lozzd • @ickymettle
cdncontrol
http://www.flickr.com/photos/foshydog/4441105829
@lozzd • @ickymettle
cdncontrol
@lozzd • @ickymettle
cdncontrol
@lozzd • @ickymettle
cdncontrol
@lozzd • @ickymettle
cdncontrol
@lozzd • @ickymettle
cdncontrol
@lozzd • @ickymettle
cdncontrol
@lozzd • @ickymettle
cdncontrol
@lozzd • @ickymettle
cdncontrol
@lozzd • @ickymettle
cdncontrol
@lozzd • @ickymettle
cdncontrol
@lozzd • @ickymettle
DNS balancing downsides▪ Low TTLs for fast convergence
▪ More DNS lookups for users
▪ Not 100% instant or deterministic
▪ Mo QPS == Mo Money
@lozzd • @ickymettle
50% within 1 minute Long Tail is Loooong
@lozzd • @ickymettle
Monitoring in Production
http://www.flickr.com/photos/9229426@N05/5160787240
@lozzd • @ickymettle
Whoopsie Page▪ Static HTML delivered for 5xx errors
- Branding
- Translated error messages
- Links to status page
@lozzd • @ickymettle
Whoopsie Page▪ Static HTML delivered for 5xx errors
- Branding
- Translated error messages
- Links to status page
@lozzd • @ickymettle
Failure Beacons1. 1x1 tracking pixel embedded in page
[...]<img src="//failure.etsy.com/status/images/beacon.gif?beacon_source=fastly_origin_failure-etsy.com"></body></html>
@lozzd • @ickymettle
Failure Beacons1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
@lozzd • @ickymettle
Failure Beacons1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
self.reg = re.compile('^\S+(\s:)? (?P<remote_addr>[0-9\.]+),? [0-9\.,\- ]+ \[[^\]]+\] \"GET /status/images/beacon\.gif\?(beacon_)?source=(?P<source>\S+) HTTP/1\.\d\" \d+ [\d\-]+ \"(?P<referrer>[^\"]+)\" \"(?P<user_agent>[^\"]+)\" .*$')
@lozzd • @ickymettle
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
Failure Beacons
@lozzd • @ickymettle
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
Failure Beacons
@lozzd • @ickymettle
Failure Beacons1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
5. Alert on Graphite graph in Nagios
@lozzd • @ickymettle
Failure Beacons1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
5. Alert on Graphite graph in Nagios
@lozzd • @ickymettle
Failure Beacons▪ Client IP address can be geolocated
@lozzd • @ickymettle
Failure Beacons▪ Optional extra debugging information
[31/Oct/2013:07:06:42 +0000] "GET /status/images/beacon.gif?beacon_source=fastly_origin_failure-etsy.com&provider_error=Connection%20timed%20out&server_identity=cache-ny57-NYC HTTP/1.1"
@lozzd • @ickymettle
Failure Beacons▪ Optional extra debugging information
@lozzd • @ickymettle
Tracking Requests to Origin
GET / HTTP/1.1User-Agent: curl/7.24.0Accept: */*X-Forwarded-Host: www.etsy.com[...]X-CDN-Provider: edgecast[...]Host: www.etsy.com
@lozzd • @ickymettle
GET / HTTP/1.1User-Agent: curl/7.24.0Accept: */*X-Forwarded-Host: www.etsy.com[...]X-CDN-Provider: edgecast[...]Host: www.etsy.com
Tracking Requests to Origin
@lozzd • @ickymettle
Backend Monitoring▪ Vendor APIs to bring data in house
@lozzd • @ickymettle
Backend Monitoring▪ Logster on CDN provider header
▪ Vendor APIs to bring data in house
@lozzd • @ickymettle
Backend Monitoring▪ Vendor APIs to bring data in house
▪ Data in-house benefits include- Integration with our anomaly detection systems
- Consistent and unified view of all CDN metrics
- We control data retention period
@lozzd • @ickymettle
Awareness▪ Over 100 engineers
▪ Deploying 60 times a day
▪ Correlating external and internal services
@lozzd • @ickymettle
Awareness
@lozzd • @ickymettle
Awareness
Deploy lines
@lozzd • @ickymettle
Frontend Monitoring▪ Performance is important to us
▪ Monitoring overall site performance
▪ Monitoring performance by CDN provider
▪ Real User Monitoring on key pages to track page performance
@lozzd • @ickymettle
Frontend Monitoring▪ Performance is important to us
▪ Monitoring overall site performance
▪ Monitoring performance by CDN provider
▪ SOASTA mPulse on key pages to track real user page performance
Downsides http://www.flickr.com/photos/39272170@N00/3841286802
@lozzd • @ickymettle
Debugging: What broke?▪ MTTD/MTTR can be extremely low with this
system
▪ But not always
@lozzd • @ickymettle
Debugging: What broke?▪ MTTD/MTTR can be extremely low with this
system
▪ But not always
@lozzd • @ickymettle
Debugging: What broke?▪ Non technical member base
▪ Confusing and time consuming
▪ Amazing support team
▪ Log as much information as possible
Conclusions/Takeaways
http://www.flickr.com/photos/sk8geek/4649776194
@lozzd • @ickymettle
Great success▪ 12 months in the benefits have far outweighed the
few downsides
▪ We’re continuing to evolve the system
▪ We’ll be sure to share our experience with the community along the way
@lozzd • @ickymettle
Links/Open Source▪ cdncontrol
http://github.com/etsy/cdncontrolhttp://github.com/etsy/cdncontrol_ui
▪ logsterhttp://github.com/etsy/logster
▪ CDN API to Graphite scripts http://github.com/lozzd/cdn_scripts
Thanks!Questions?
@lozzd • @ickymettle