architecting for change: qconnyc 2012

68
Optimized for change: Architecture @ Etsy Kellan Elliott-McCrea @kellan CTO, Etsy Monday, June 18, 12

Upload: kellan

Post on 08-Sep-2014

14.176 views

Category:

Sports


0 download

DESCRIPTION

a broad overview of Etsy's why and how.

TRANSCRIPT

Page 1: Architecting for Change: QCONNYC 2012

Optimized for change: Architecture @ Etsy

Kellan Elliott-McCrea@kellanCTO, Etsy

Monday, June 18, 12

Page 2: Architecting for Change: QCONNYC 2012

Monday, June 18, 12

Page 3: Architecting for Change: QCONNYC 2012

Launched June 18, 2005875,000 active sellers33.5MM items for sale$65.9MM in sales, in May1.4B page views, in May102 engineers32 releases, last Friday

Monday, June 18, 12

Page 5: Architecting for Change: QCONNYC 2012

Why?

Monday, June 18, 12

Page 6: Architecting for Change: QCONNYC 2012

3 inevitabilities we design for:

1. Things break, unexpectedly 2. What we're building changes 3. We don't get to start over

Monday, June 18, 12

Page 7: Architecting for Change: QCONNYC 2012

2 years of change.

Monday, June 18, 12

Page 8: Architecting for Change: QCONNYC 2012

* Don't bet against the future.* Our customers are humans.* Simplicity always wins, in the end.* Favor global vs local optimization.* Ambiguity kills momentum.* Make failure cheap.* Technical debt is an inevitable by-product of shipping code.* Optimize for change.

Architectural Principles

Monday, June 18, 12

Page 9: Architecting for Change: QCONNYC 2012

Ckrickett, http://www.etsy.com/listing/90611466

ClevernessMonday, June 18, 12

Page 10: Architecting for Change: QCONNYC 2012

Ckrickett, http://www.etsy.com/listing/90611466

Complex systems and change 1. Distributed systems are inherently complex.

2. The outcome of change in complex systems is hard to predict.

3. The outcome of small, frequent, measurable changes are easier to predict, easier to recover from, and promote learning.

Monday, June 18, 12

Page 11: Architecting for Change: QCONNYC 2012

Ckrickett, http://www.etsy.com/listing/90611466

Continuous deployment, Metrics Driven Development, Blameless

Post-Mortems

Monday, June 18, 12

Page 12: Architecting for Change: QCONNYC 2012

Ckrickett, http://www.etsy.com/listing/90611466

Continuous deployment: Small, frequent changes to production

Monday, June 18, 12

Page 13: Architecting for Change: QCONNYC 2012

Continuous Deployment:

No branching.

“All existing revision control systems werebuilt by people who build installedsoftware”- Paul Hammond,Always Ship Trunk, Velocity 2010Thursday, March 17, 2011

Monday, June 18, 12

Page 14: Architecting for Change: QCONNYC 2012

if ($cfg[‘awesome_new_search’]) {# new hotness$rsp = do_solr();

} else {# boring old stuff$rsp = do_grep();

}

Continuous Deployment:

feature flags

Monday, June 18, 12

Page 15: Architecting for Change: QCONNYC 2012

Continuous Deployment:

Ramp - ups (on top of feature flags)

1. Launch to staff only2. Launch to 1% of all users3. Launch to members of a beta group

Monday, June 18, 12

Page 16: Architecting for Change: QCONNYC 2012

Continuous Deployment:

any engineer can launch a feature to

1% of users

Monday, June 18, 12

Page 17: Architecting for Change: QCONNYC 2012

Continuous Deployment:

~200 experiments live right now

Monday, June 18, 12

Page 18: Architecting for Change: QCONNYC 2012

Metrics driven development:

introspection isn’t optional. measure everything,log everything

Monday, June 18, 12

Page 19: Architecting for Change: QCONNYC 2012

Metrics driven development:

Metrics happen when you make it easy. And visible.

Monday, June 18, 12

Page 20: Architecting for Change: QCONNYC 2012

Metrics driven development:

holtWintersConfidence(Upper|Lower)

Teach computer to read graphs

Monday, June 18, 12

Page 22: Architecting for Change: QCONNYC 2012

Optimize for MTTR, not MTBF

Monday, June 18, 12

Page 23: Architecting for Change: QCONNYC 2012

How?

Monday, June 18, 12

Page 24: Architecting for Change: QCONNYC 2012

Etsy

Monday, June 18, 12

Page 25: Architecting for Change: QCONNYC 2012

Etsy

EMR/S3PCI

BCP, Cold

Monday, June 18, 12

Page 26: Architecting for Change: QCONNYC 2012

inbound request

etsy.com/api.etsy.com

/atlas

etsystatic.com/photosbcn.etsy.com

CDNs - diversified at the DNS level

Internet providers - diversified at borders

Etsynetwork appliances

AWS

analytics imstor

apachephp application MySQL search memcache async http StatsD sqlite gearmanlogsserver/OShardware

Squidapachephp imstor NFS

apachelogslogrotateHDFSanalytics

EMRJRuby/CascadingS3PHPMySQL

S3

search

ThriftJetty Solr slaves datasetsSolr masterHBasesharded MySQL

MySQL

dbindexdbshardsdbauxdbdata

mail out

SMTPX-Yarnblaster

etc

PCIvia jsonp, no privileged access

Monday, June 18, 12

Page 27: Architecting for Change: QCONNYC 2012

CDNs: Put a slider on it

Just works via weighted DNS

Monday, June 18, 12

Page 28: Architecting for Change: QCONNYC 2012

Apache

* Well known* PHP is native* apache_note* fast start time* cheap in place replacement* .htaccess* Challenge: memory usage

Monday, June 18, 12

Page 29: Architecting for Change: QCONNYC 2012

Apache: apache_note

apache_note('etsy_uaid', $id);

Additive! insanely useful!

introspection through the life cycle

Monday, June 18, 12

Page 30: Architecting for Change: QCONNYC 2012

LogFormat "%{X-Forwarded-For}i % {True-Client-IP}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" % {etsy_shop_id}n %{etsy_uaid}n %V % {etsy_ab_selections}n % {etsy_request_uuid}n % {etsy_api_consumer_key}n % {etsy_api_method_name}n % {php_memory_usage_bytes}n % {php_time_microsec}n %D" combined

Apache: log format

Monday, June 18, 12

Page 31: Architecting for Change: QCONNYC 2012

Etsy: the App

* 487,000 lines of PHP* 214,000 lines of Javascript* Monolithic codebase* 3 front ends, Etsy.com, API, Atlas

Monday, June 18, 12

Page 32: Architecting for Change: QCONNYC 2012

Etsy: the App

* routing handled by Apache* scripts fronting OO PHP5* PHP, fast by default* opcode caching* Challenge: liveliness when calling services

Monday, June 18, 12

Page 33: Architecting for Change: QCONNYC 2012

Etsy: coding patterns

* light weight, home rolled “framework”* ORM handles DAO across backends* config and feature flags systems used everywhere* small slow moving datasets stored as PHP arrays* A/B tests* Smarty* StatsD* Concurrency * memcache

Monday, June 18, 12

Page 34: Architecting for Change: QCONNYC 2012

Etsy: A/B tests

* beaconed* inserted into logs via apache_note* conditionalized on feature flags* nightly reports on conversion, bounce rate, etc* nightly reports on page speed, memory usage, etc

Monday, June 18, 12

Page 35: Architecting for Change: QCONNYC 2012

Etsy: Smarty

* pre-compiled* pre-compiled per language

Monday, June 18, 12

Page 36: Architecting for Change: QCONNYC 2012

Etsy: StatsD

StatsD::increment("logins.success");StatsD::timing("gearman.time", $msec);

* 340,000 application metrics

Monday, June 18, 12

Page 37: Architecting for Change: QCONNYC 2012

Etsy: Concurrency

* no native concurrency in PHP* asynchronous HTTP calls* Gearman

Monday, June 18, 12

Page 38: Architecting for Change: QCONNYC 2012

Etsy: Async HTTP calls

* curl_multi_exec* non-blocking, per request time outs* used for optional aspects of a page* curl against http://localhost to avoid network overhead

Monday, June 18, 12

Page 39: Architecting for Change: QCONNYC 2012

Etsy: Gearman

* language agnostic job server* don’t use an MQ when you want a job server* 150 job types* persistent jobs flushed to MySQL, read from memory* non-persistent jobs just stored in memory* NP queue is wicked fast.

Monday, June 18, 12

Page 40: Architecting for Change: QCONNYC 2012

Etsy: Gearman

* scaling CPU of cron jobs* denormalizing data* pushing to 3rd party services

Monday, June 18, 12

Page 41: Architecting for Change: QCONNYC 2012

Etsy: Challenges

* Apache memory usage* liveliness talking to services, no concurrency, blocking by default

Monday, June 18, 12

Page 42: Architecting for Change: QCONNYC 2012

Etsy: graph of distributed failure

Monday, June 18, 12

Page 43: Architecting for Change: QCONNYC 2012

Etsy: Challenges* Apache memory usage* liveliness talking to services: no concurrency, blocking by default

Enforce liveliness with a judicious application of force

Monday, June 18, 12

Page 44: Architecting for Change: QCONNYC 2012

Etsy: judicious application of force

list($v, $res, $shar) = @fopen(‘/proc/self/statm', 'r');$mine = $res-$shar;if ($mine > $cfg[‘sizelimit’]) { $pid = getmypid(); @exec("kill -USR1 $pid");}

Monday, June 18, 12

Page 45: Architecting for Change: QCONNYC 2012

Etsy: judicious application of force

Bowhunter* Find long running PHP processes* Try to avoid those mid-post

open(APACHE, "/usr/bin/curl -s http://localhost/server-status|") || die "$!";

Monday, June 18, 12

Page 46: Architecting for Change: QCONNYC 2012

Etsy: judicious application of force

Query_killer* Same idea, long running queries* MySQL “SHOW PROCESSLIST();”

Monday, June 18, 12

Page 47: Architecting for Change: QCONNYC 2012

Memcache

* Caching, obviously* Cache invalidation is hard* Write buffering* multi_get* rate limits

Monday, June 18, 12

Page 48: Architecting for Change: QCONNYC 2012

Memcache

* atomic INCR is awesome* slice your time windows to reduce risk of cache eviction* we’ve been unlucky, lots of segfaults :(* multi_get slows down the more boxes in the pool

Monday, June 18, 12

Page 49: Architecting for Change: QCONNYC 2012

MySQL: By the numbers

* 25K+ queries/sec avg * 3TB InnoDB buffer pool* 15TB + data stored* 50 servers* 99.99% queries under 1ms

Monday, June 18, 12

Page 50: Architecting for Change: QCONNYC 2012

MySQL: a NotMuchSQL server

* no joins * no foreign keys* no transactions or locks* no sub-selects* store data like you want to read it.* also: no auto_increment

Monday, June 18, 12

Page 51: Architecting for Change: QCONNYC 2012

MySQL: a NotMuchSQL server

“Normalization is for sissie.” - Cal Henderson, Flickr

Monday, June 18, 12

Page 52: Architecting for Change: QCONNYC 2012

MySQL: scale horizontally

* objects shared by key * lookups maintained in dbindex (MySQL is a FAST key-value store)* avoid key hashing, range partitions, and partitioning functions

more: http://www.slideshare.net/jgoulah/the-etsy-shard-architecture-starts-with-s-and-ends-with-hard

Monday, June 18, 12

Page 53: Architecting for Change: QCONNYC 2012

MySQL: Master-Master

* objects hashed to a side, avoid split brain * allows in place schema upgrades without slave promotion* simplified capacity planning

more: http://codeascraft.etsy.com/2012/04/20/two-sides-for-salvation/

Monday, June 18, 12

Page 54: Architecting for Change: QCONNYC 2012

web0038 : [Mon Jun 18 09:58:38 2012] [error] [client 10.101.1.12] [C6kds9y1MVptEDMoOe5KCYha9VWl] [error] [ORM_LONG_QUERY] [/var/etsy/current/phplib/EtsyORM/Query/RawSql.php:752] [15877310] Query exceeded 10 seconds: long_query_time=83.0927 long_query_string='/* [etsy_shard_005_A] [/remove_favorite_listing.php] */ DELETE FROM `users_favoritelistings` WHERE `user_id` = ? AND `listing_id` = ?' long_query_trace='#10 __construct() /EtsyModel/UserFavoriteListingMirror.php:310 #4 delete() /EtsyModel/UserFavoriteListing.php:39 #3 delete() /EtsyModel/User.php:1840 #2 unfavoriteListing() /Controller/Favorites.php:344 #1 removeFavoriteListingRecord() /Controller/Favorites.php:94 #0 performRemoveFavoriteListing() /var/etsy/current/htdocs/remove_favorite_listing.php:9', referer: http://www.etsy.com/people/kellanem/favorites?page=5

MySQL: Introspection

SQL Comments are awesome!

Monday, June 18, 12

Page 55: Architecting for Change: QCONNYC 2012

MySQL: Deletes are expensive

* update objects to state=‘deleted’* use partitions* truncatenator - on ext3, hard link file, move, delete slowly.

Monday, June 18, 12

Page 56: Architecting for Change: QCONNYC 2012

Anatomy of a feature: Shop Stats$GG�1HZ�,WHP

/LVWLQJV

6HFWLRQV

6KLSSLQJ

6ROG�2UGHUV

6KRS�3D\PHQW�$FFRXQW

6KRS�6WDWV

<RXU�%LOO

%LOOLQJ�,QIR

,QIR��$SSHDUDQFH

6KLSSLQJ��3D\PHQW

2SWLRQV

6HDUFK�$GV

&RXSRQ�&RGHV

(WV\�0LQL

6HOOHU�+DQGERRN

(WV\�$SSV

.H\ZRUGV 6HDUFK�(QJLQHV� � (WV\ � *RRJOH � 2WKHUV 9LHZV

WDWWRR �

NHOODQ�MXVW�VKLS�W�VKLUW �

3DJHV�9LHZHG ����9LHZV

7UDIILF�6RXUFHV 9LHZV

'LUHFW�7UDIILF� ��

HWV\�FRP ��

IDFHERRN�FRP �

P�IDFHERRN�FRP �

JRRJOH�FR�XN �

VYSSO\�FRP �

W�FR �

7UDIILF�6RXUFHV�RQ�(WV\ 9LHZV

7HDPV ��

<RXU�6KRS ��

2WKHU ��

6HDUFK �

<RXU�/LVWLQJV �

<RXU�3URILOH �

&DWHJRULHV �

(WV\�+RPH�3DJH �

6RPHRQHV�6KRS �

,WHPV

2UGHUV

%LOO

6KRS�6HWWLQJV

3URPRWH

5HVRXUFHV

+L�WKHUH��<RXUH�D�PHPEHU�RI�WKH�6KRS�6WDWV�%HWD�DQG�ZHUH�WHVWLQJ�QHZ�IHDWXUHV��&KHFN�RXW�WKH�6KRS�6WDWV�%HWD�7HDPIRU�PRUH�GHWDLOV�

6KRS�6WDWV IRU -XQ������������-XQ���������

6WDWV�IRU /DVW���'D\V

���

���� VKRS�YLHZV��� OLVWLQJ�YLHZV

9LHZV

�� VKRS�IDYRULWHV�� OLVWLQJ�IDYRULWHV

)DYRULWHV 2UGHUV

9LHZ�2UGHUV�IRU�-XQ�����

� ������86'

5HYHQXH

VKRS OLVWLQJ

*UDSKV 9LHZV )DYRULWHV 2UGHUV 5HYHQXH

-XQ�������

-XQ�������

-XQ�������

-XQ�������

-XQ�������

-XQ�������

-XQ�������

��

��

��

��72'$<

+RYHU�RYHU�WKHVH�LFRQV�WR�YLHZ�HYHQWV�RQ�WKDW�GD\�

� � �

� � �

8VHU��������6HUYHU6HUYHU�����PV���PV 5HQGHU5HQGHU�����PV���PV &RPSOHWH&RPSOHWH�����PV���PV 3DJH�6WDWV 3URILOHU

%X\ 6HOO 5HJLVWU\ &RPPXQLW\ %ORJV 0RELOH +L��.HOODQ� <RXU�6KRS��NHOODQHP <RXU�$FFRXQW +HOS 6LJQ�2XW

� ��6HDUFK�IRU�LWHPV�DQG�VKRSV 6HDUFK6HDUFK &DUW&DUW

Monday, June 18, 12

Page 57: Architecting for Change: QCONNYC 2012

Anatomy of a feature: Shop Stats

$GG�1HZ�,WHP

/LVWLQJV

6HFWLRQV

6KLSSLQJ

6ROG�2UGHUV

6KRS�3D\PHQW�$FFRXQW

6KRS�6WDWV

<RXU�%LOO

%LOOLQJ�,QIR

,QIR��$SSHDUDQFH

6KLSSLQJ��3D\PHQW

2SWLRQV

6HDUFK�$GV

&RXSRQ�&RGHV

(WV\�0LQL

6HOOHU�+DQGERRN

(WV\�$SSV

.H\ZRUGV 6HDUFK�(QJLQHV� � (WV\ � *RRJOH � 2WKHUV 9LHZV

WDWWRR �

NHOODQ�MXVW�VKLS�W�VKLUW �

3DJHV�9LHZHG ����9LHZV

7UDIILF�6RXUFHV 9LHZV

'LUHFW�7UDIILF� ��

HWV\�FRP ��

IDFHERRN�FRP �

P�IDFHERRN�FRP �

JRRJOH�FR�XN �

VYSSO\�FRP �

W�FR �

7UDIILF�6RXUFHV�RQ�(WV\ 9LHZV

7HDPV ��

<RXU�6KRS ��

2WKHU ��

6HDUFK �

<RXU�/LVWLQJV �

<RXU�3URILOH �

&DWHJRULHV �

(WV\�+RPH�3DJH �

6RPHRQHV�6KRS �

,WHPV

2UGHUV

%LOO

6KRS�6HWWLQJV

3URPRWH

5HVRXUFHV

+L�WKHUH��<RXUH�D�PHPEHU�RI�WKH�6KRS�6WDWV�%HWD�DQG�ZHUH�WHVWLQJ�QHZ�IHDWXUHV��&KHFN�RXW�WKH�6KRS�6WDWV�%HWD�7HDPIRU�PRUH�GHWDLOV�

6KRS�6WDWV IRU -XQ������������-XQ���������

6WDWV�IRU /DVW���'D\V

���

���� VKRS�YLHZV��� OLVWLQJ�YLHZV

9LHZV

�� VKRS�IDYRULWHV�� OLVWLQJ�IDYRULWHV

)DYRULWHV 2UGHUV

9LHZ�2UGHUV�IRU�-XQ�����

� ������86'

5HYHQXH

VKRS OLVWLQJ

*UDSKV 9LHZV )DYRULWHV 2UGHUV 5HYHQXH

-XQ�������

-XQ�������

-XQ�������

-XQ�������

-XQ�������

-XQ�������

-XQ�������

��

��

��

��72'$<

+RYHU�RYHU�WKHVH�LFRQV�WR�YLHZ�HYHQWV�RQ�WKDW�GD\�

8VHU��������6HUYHU6HUYHU�����PV���PV 5HQGHU5HQGHU�����PV���PV &RPSOHWH&RPSOHWH�����PV���PV 3DJH�6WDWV 3URILOHU

%X\ 6HOO 5HJLVWU\ &RPPXQLW\ %ORJV 0RELOH +L��.HOODQ� <RXU�6KRS��NHOODQHP <RXU�$FFRXQW +HOS 6LJQ�2XW

� ��6HDUFK�IRU�LWHPV�DQG�VKRSV 6HDUFK6HDUFK &DUW&DUW

“Never get into a land war in Asia, and never build an analytics tool on top of MySQL.

Monday, June 18, 12

Page 58: Architecting for Change: QCONNYC 2012

Anatomy of a feature: Shop Stats

* buffer writes in Memcache using predictable keys* flush to MySQL tables periodically via cron* bake old data into all possible date ranges, and archived to S3* truncate tables

Monday, June 18, 12

Page 59: Architecting for Change: QCONNYC 2012

Monday, June 18, 12

Page 60: Architecting for Change: QCONNYC 2012

bcn.etsy.com: beaconed event stream

* Server-side and javascript event stream* At least one per page view* Apache serving static assets* Aggregated on HDFS via logrotate* Archived on S3* Analyzed via JRuby/Cascading on Hadoop* Doesn’t use: Flume, Scribe, etc

Monday, June 18, 12

Page 61: Architecting for Change: QCONNYC 2012

bcn.etsy.com: beaconed event stream

{"event_guid":"c2ffb51808b.6d2be52959ef{".user_id":8528531,"php_event_name":"s2","php_unique_id":"4fdf1cb5d5c078.37523961","php_event_date":"18\/Jun\/2012:08:19:01","locale_currency_code":"USD","pref_language":"en-US","region":"US","detected_region":"US","accept-languages":"en-US,en","isMobileDevice":"0","isMobileSupported":"0","isTabletSupported":"0","isTouch":"0","isEtsyApp":"0","listing_ids":[60274277,101504389,98682771,88585080],"cids":[14103953,14239293,14247717,14209614],"query":"blue","keywords":["blue","blue","blue","blue"],"position":1,"replay_number":1,"s2_cached":1,"php_ab_test_names":"orm_record_instance_caching;mobile_detector.all_blackberry;multilang_shops_listings.view;ga_replacement_cookie;disable_search_autosuggest;admin_toolbar;translations.live_translations;ab_analytics_test;search_type_experiment;search_ads.max_replays_less;search_diversity_experiment;search_cached_listing_cards;placefinder.cache_memcached_migration;search_stream_a;search_all_items_ignores_supplies;search_default_type;search.two_cluster_deploy;search_parameter_sample;thrift_category2_transform;search.similar_listing_browse_page;orm_replicant_safe_find_many;bottom_first;foreign_language_carousel;search.related_searches_all_items;weddings.srp_promos;search_log_page_position;newrelic;clientlog;google_analytics_async;personalized_endpoint;search_no_dropdown;community_nav_popout;security_settings;search_changes_tooltip;inline_listing_hearts;framelogger;log_normal;analytics_second_beacon;analytics_second_beacon_privileged;analytics_second_beacon_mobile","php_ab_var_names":"1;1;1;1;control;1;0;A;ponycorn_v3;1;threshold_off;1;1;1;0;all_sans_supplies;0;1;1;1;1;0;top;0;0;1;0;1;0;1;1;1;0;1;1;1;0;1;0;1","php_ab_selector_names":"

Monday, June 18, 12

Page 62: Architecting for Change: QCONNYC 2012

Search Master

Search Slave01

Search Slave02

Search SlaveNN

BitTorrent to distribute indexes

Web01

Web02

WebNN

100% of all indexes on each slave

Thrift, with server affinityto improve cache hit ratio,just returns ids

databases and memcache

hydrate IDs via multi-get,ignore a few failures

denormalized listing store, transition from MySQL to

Hbase, not user facing

pull via cron, push via gearman

incremental index, every 7 minutes, avoid even numbered cron times

Search

Monday, June 18, 12

Page 63: Architecting for Change: QCONNYC 2012

Search* Solr trunk* Custom ranking via crunched datasets* BitSet fields for personalized search* Scaling the JVM* 32% of visits, 40% of sales* Also powers categories, unshardable queries* Next time, just use HTTP* Up next: custom codecs* Avoiding sharding

Monday, June 18, 12

Page 64: Architecting for Change: QCONNYC 2012

Search* JVM slow start* Search deployinator does rolling restart* HotSpot and GC causes unpredictable throughput* Overfetch - ask multiple servers, go with 1st response* Index size is important. Don’t store too much.

Monday, June 18, 12

Page 65: Architecting for Change: QCONNYC 2012

Photos* 400 million photos* Uploaded locally, then streamed to S3* GraphicsMagick FTW* Working set is tiny, served out of Squid* 2% read failure rate during full S3 outage.* 0% write failure rate during full S3 outage.

JonathanOtis, http://www.etsy.com/listing/96361102/

Monday, June 18, 12

Page 66: Architecting for Change: QCONNYC 2012

Technology no longer part of the stack

* Python Twisted * PostgreSQL and stored procedures * Scala and MongoDB * Clojure and Tokyo Tyrant * Rails* ActiveMQ * RabbitMQ * a "Routes" framework * building RPMs* Lighttpd

Monday, June 18, 12

Page 67: Architecting for Change: QCONNYC 2012

Take aways 1. A few simple, boring, well known components 2. Extensive instrumentation 3. Rapid iteration and feedback loops 4. Human centric 5. A few tweaks on the classics for scale 6. Technology supports business goals

Monday, June 18, 12