scaling up to 30m users - the wix story

24
Scaling up to 30M users Aviran Mordo Server Group Manager Wix @aviranm Scaling Software, Scaling Data & Scaling People The Wix Experience Devcon TLV Feb 2013

Upload: aviran-mordo

Post on 22-Apr-2015

1.101 views

Category:

Technology


0 download

DESCRIPTION

How to grow a startup with no users to 30M users. The Wix story

TRANSCRIPT

Page 1: Scaling up to 30M users - The Wix Story

Scaling up to 30M users

Aviran MordoServer Group ManagerWix@aviranm

Scaling Software, Scaling Data & Scaling PeopleThe Wix Experience

Devcon TLV Feb 2013

Page 2: Scaling up to 30M users - The Wix Story

About Wix

Page 3: Scaling up to 30M users - The Wix Story

Wix in Numbers

• Wix was founded in 2006• 30M registered users from most countries• Over 1,000,000 new users every month• ~1,000,000 new websites every month• Over 150 TByte of users media files

– More than 1 billion users media files– More than 1.5 TByte uploaded files daily

• Over 300 Servers in 2+1 datacenters + Google + Amazon

Page 4: Scaling up to 30M users - The Wix Story

Wix Initial Architecture

• Tomcat, Hibernate, Custom web framework– Everything generated from HBM files– Built for fast development– Statefull login (tomcat session), EHCache, File uploads– Not considering performance, scalability, fast feature rollout, evaluate– It reflected the fact that we didn’t really know what is our business– We know that we will need to replace it when we grow.– However, we failed to understand how difficult that can be!

2006 2007 2008 2009 2010 2011 2012 2013

Flash

HTML 5

Wix(Tomcat)

MySQLDB

Page 5: Scaling up to 30M users - The Wix Story

Wix Initial Architecture

After two years, we have found out that• Our initial architecture allowed us to progress vary fast• However, as we progressed, we slowed down• So, we learned that

– Don’t worry about ‘building it right from the start’ – you won’t– You are going to replace stuff you are building in the initial stages– Be ready to do it– Get it up to customers as fast as you can. Get feedback. Evolve.– Our mistake was not planning for gradual re-write– Build for gradual re-write as you learn the problems and find the right

solutions

Page 6: Scaling up to 30M users - The Wix Story

Distributed Cache

Next we added EHCache as Hibernate 2nd-level cache• Why?

– Cause it is in the design• How was it?

– Black Box cache– How do we know what is the state of our system?– How to invalidate the cache?– When to invalidate it?– How does “operations” manage the cache?

• Did we really need it? No!• We eventually dropped it

2006 2007 2008 2009 2010 2011 2012 2013

Flash

HTML 5

Page 7: Scaling up to 30M users - The Wix Story

2006 2007 2008 2009 2010 2011 2012 2013

Flash

HTML 5

Editor & Public Segments

• The Challenge - Updates to our Server imposed downtime for our customer’s websites– Any Server or Database update has the potential of bringing down all Wix sites– Is a symptom of a larger issue

• The Server served two different concerns– Wix Users editing websites– Viewing Wix Sites, the sites created by the Wix editor

• The two concerns require different SLA– Wix Sites should never ever have a downtime! – Wix Sites should work as fast as possible, always! – However, an editing system does not require this level of SLA.

Page 8: Scaling up to 30M users - The Wix Story

Editor & Public Segments

• The two concerns evolve independently – Releases of Editing feature should have no impact on

existing Wix sites operations!• Our Solution

– Split the Server into two Segments – Public and Editor• The Public segment targets serving websites for

Wix Users– Has mostly read-only usage pattern – only updated

when a site is published– Simple publishing system– Simple and readonly means it is easier to have higher SLA and DRP– MySQL used as NoSQL – single large table with XML text fields

• The Editor segment – Exposes the Wix Editing APIs, as well as user account and galleries

management APIs.– Has different release schedule compared to the Public segment

Public(Tomcat)

Public DB

Editor(Tomcat)

Editor DB

Page 9: Scaling up to 30M users - The Wix Story

Editor & Public Segments

What we have learned• MySQL is a damn good NoSQL engine

– Our public DB was (mainly) one huge table– Queries & Updates are by primary key– Instead of relations, we use text/xml or text/json columns– No updates for Blobs – immutable data– No Transactions

• Use indirection table to blob table– Insert a new blob value, update the pointer to the new blob, async delete

• MySql auto-generated keys cause problems– Locks on key generation– Require a single instance to generate keys

• We use GUID keys– Can be generated by any client– No locks in key value generation– Enabler for Master-Master replication

Public(Tomcat)

Public DB

Editor(Tomcat)

Editor DB

Page 10: Scaling up to 30M users - The Wix Story

Wix on Managed Hosting

2006 2007 2008 2009 2010 2011 2012 2013

Flash

HTML 5

Co-Location Managed Hosting Cloud

Own and maintain your own hardware

Lease both hardware and maintenance

Instantly lease hardware

Provisioning == buy and deliver your new server

Overnight provisioning Instant provisioningUnlimited resources

Reliable software on reliable hardware

Reliable software on reliable hardware

Reliable software on unreliable hardware

Page 11: Scaling up to 30M users - The Wix Story

2006 2007 2008 2009 2010 2011 2012 2013

Flash

HTML 5

Wix Media Segment

• The Challenge – Our static storage reached over 500 GByte of small files– The “upload to app server, post process files, copy to lighttpd server, serve by

lighttpd” pattern proved inefficient, slow and error prone– Disk IO became slow and inefficient as the number of files increased– We needed a solution we can grow with –

• HTTP connections• number of files

– We needed control over caching and Http headers• We needed dynamic image manipulations

– Rebuild a few millions of media files is not simple

Page 12: Scaling up to 30M users - The Wix Story

20-ef 40-5f 60-7f00-1f

5.static 7.static3.static1.static

0.static 2.static 6.static4.staticHTTP HTTP HTTP

HTTP HTTP HTTP

get 37D815B5.jpg Go to 37 range servers Fallback if not found

Prospero – Wix Media Storage

• Our Solution– Lighttpd based– Sharded on the file name– Two copies of each file

Page 13: Scaling up to 30M users - The Wix Story

• Dynamic Image processing– Picture Pyramid– Picture resize, crop and sharpen “on the fly”– Thumbnail generation

• Eventual Consistency solutions scale– But you have to build for when eventual consistency is not consistent

• Media files caching headers are critical– Max-age, ETag, if-modified-since, etc.– Think how to tune those parameters for media files, as per your specific needs

• We tried Amazon S3 and Google for secondary storage– However, Amazon proved unreliable (connections, availability)

• We found that using a CDN in front of Prospero is very effective• Initially, files where stored on the filesystem• We added Tokyo Tyrant backend for small files• We added Memcached (Redis) layer for “in transit” files

Prospero – Wix Media Storage

T

M

Page 14: Scaling up to 30M users - The Wix Story

• Our current architecture

Prospero – Wix Media Storage

x36TM x36

TM x32TM

x36TM x36

TM x32TM

Google Cloud Storage

Austin

Chicago

get 37D815B5.jpg

First fallback

Second fallback

CDNIf not in CDN

Page 15: Scaling up to 30M users - The Wix Story

CDN

• Use a CDN!• CDN acts as a great connection manager

– We have CDN hit ratio’s of over 99.9%• Use the “Cache Killer” pattern

– http://static.wix.com/client/css/viewer.css?v=327– http://static.wix.com/client/1.3.2/css/viewer.css– Makes flushing files from the CDN redundant– Enabler for longer caching periods

• There are many vendors– We started with 1 CDN vendor– We are now working with two CDN vendors– Different CDN vendors have advantages at different geo

• Tune HTTP Headers per CDN Vendor– CDN Vendors interpret HTTP headers differently

Page 16: Scaling up to 30M users - The Wix Story

2006 2007 2008 2009 2010 2011 2012 2013

Flash

HTML 5

Development Velocity

• The Challenge – Our codebase became large and entangled– Feature rollout became harder over time, requiring longer and longer manual

regression– The longer the regression was, the harder is became to make “a good release” – Strange full-table scans queries generated by Hibernate, which we still have no

idea what code is responsible for…• The solution

– Mid 2010 – Wix Framework – modern base libraries– Beginning 2011 – CI / CD / TDD techniques + DevOps culture– Mid 2011 – Scala– SOA Architecture (not WSDL)

Framework

CI / CD / TDD + DevOps

Scala

Page 17: Scaling up to 30M users - The Wix Story

People are the key

• Train the people you already have– We sent our entire QA department to learn Java– Developers learn TDD and CI/CD methodologies.

• Hiring the right people is key to success– Hire only the best developers (only seniors)– Don’t count only on the interview, you need to test actual coding– Anyone who interviews can drop a candidate– Hire people who will challenge you (no “yes man”)– Get people you can trust with “root” access to production

• Never stop hiring– If we find an excellent person we will create a position for him even if we do

not have one open.• Wix is doubling its size every year

– Yes we are currently hiring.– We’re considering to start hiring and training junior developers.

Page 18: Scaling up to 30M users - The Wix Story

Wix’s CI / CD / TDD + DevOps model

• Abandon “VERSION” paradigm – move feature centric life• Make small and frequent release as soon as possible

– Today we release about 10 times a day, gaining velocity• Empower the developer

– The developer is responsible from product idea to 100,000 active users– Remove every obstacle in the developer’s path– Big cultural change from waterfall – affects the whole company– The developer is responsible for his app operations

• Automate everything – CI/CD/TDD– CI – Continuous Integration– CD – Continuous Delivery / Deployment– TDD – Automated unit-tests, integration tests, GUI tests

• Measure Everything (The lean startup way)– A/B test every new feature– Monitor real KPIs (business, not CPU)

Page 19: Scaling up to 30M users - The Wix Story

CI / CD @ Wix – Release Process

• Make an RC– Runs build, unit-tests, integration tests

Page 20: Scaling up to 30M users - The Wix Story

CI / CD @ Wix – Release Process

• Deploy as GA– Using Chef, Noah, Artifactory– Runs Self-Tests

Page 21: Scaling up to 30M users - The Wix Story

CI / CD @ Wix – Release Process

• Monitor– Deployment, NewRelic, App-Info, Recent Events

• Rollback

Page 22: Scaling up to 30M users - The Wix Story

2006 2007 2008 2009 2010 2011 2012 2013

Flash

HTML 5

Products we’ve built (partial list)• Wix Mobile• Wix HTML5

– Full HTML 5 support – total rewrite of our Flash product• Third Party Applications (TPAs)

– With over 200,000 installations in the 3 first months• Answers

– Wix unique support system• Wix Billing System (PCI Compliant)

– Support complex business models for TPAs– Support diverse geo

• eCommerce– Based on Magento

• BI Mobile

HTML 5

TPABilling

Answers

App BuildereCommerce

Page 23: Scaling up to 30M users - The Wix Story

Wix Hackathon

• http://www.wix.com/publicevents/hackathon2013

Page 24: Scaling up to 30M users - The Wix Story