cloudaustin black friday 2013
DESCRIPTION
A 2014 CloudAustin presentation on how we prepared for and executed on our high traffic surge over Black Friday.TRANSCRIPT
Black Friday 2013
Ernest Mueller, Bazaarvoice Engineering
What Is Black Friday?
• The National Retail Federation writes: For some retailers, the holiday season [Nov-Dec] can represent as much as 20-40% of annual sales.
• ShopperTrak says: National retail sales increased 2.7% and foot traffic decreased 14.6% when compared to the same two months last year (2012).
• Black Friday (the Friday after Thanksgiving) and Cyber Monday (the Monday after that) have become big discounting and promotional events that retailers use to push holiday purchasing.
• Summary: It’s a big deal to many of our clients and is becoming more ecomm-driven every year
3
Historically
In 2011 we served
1.52 BAnd in 2012 we served
2.03 B.
Roadmap Prediction
Bazaarvoice expected
review impressions on Black Friday & Cyber Monday 2013. That’s a 30% YoY growth rate.
Results
Bazaarvoice served
review impressions on Black Friday & Cyber Monday 2013. That’s a 31.4% YoY growth rate.
Black Friday/Cyber Monday 2013 @BV
2.67 B2.6 B
If you took all the reviews we served up to shoppers on
Black Friday 2013 and printed them into paperback book
form, it would take a bookshelf almost 11 miles long
to hold them.
Step 0: Architecture
Scaling Isn’t Just For Black Friday
• We continuously work to scale the product – our data size doubles year over year
• Architectural changes to meet the demand are constant and ongoing – there is no “maintenance mode” at scale
• Your base architecture needs to be scalable
• Then you have to refactor again and again
10
The Three Amigos
Dove’s Thoughts• Upping performance and
running your system at 40% instead of 80% gave a lot of insight into our second order set of bottlenecks and performance characteristics
• The choice of where to place/span ASGs and other Amazon bits was a major talking point among the Amigos, and ended up being located per AZ because of our DNS/HAProxy front end
• The “diagonal scaling” challenge of instance size vs number of instances vs PIOPS speed is hard and you basically just have to run tests to dial in on the minima; this changes a lot over time
• Remember, with the public cloud a lot of this is black box and while that removes a lot of work from you, it adds other work and requires certain best practices to make the most of your system
Step 1: Planning
This Year
• We started Black Friday specific work on August 12, 2013.
• That’s when client readiness surveys start coming in!
• We’ve done this previous years, but this year there was a big additional demand placed on the planning…
15
The Old Meets The New
Communicate and Coordinate
• The first step is always internal communication
• We create an “Internal Preparedness Statement” to provide a concise, definitive statement for Engineering, Sales, Support, and Implementation
• Regular weekly prep status meetings
• From the August 12 “Planning is beginning” notification till the celebratory happy hour on Dec 16, I have 1,287 emails that mention “Black Friday.”
• Due to the new distributed-team challenge, we needed a person responsible for coordinating our overall Black Friday response…
Step 2: Freezing
BV Holiday Freeze StatementSoft FreezeWe observe a general change freeze period starting 1 November and ending 15 January. During this period, we do not introduce changes to Bazaarvoice products that are integrated with our clients' websites. We may introduce changes into back-end systems that do not impact the end-user site experience.
Hard FreezeWe only release infrastructure and configuration changes required to restore service to or prevent a service disruption to one or more of our customers. The Critical System Change periods are:• 5 days prior to and 5 days after Black Friday (24 November
2013 through 4 December 2013)• 4 days prior to and 7 days after Christmas (21 December
2013 through 1 January 2014)
What Does Freeze Mean To You?
Step 3: Scaling
Traffic Projections and Scaling Plan
• Sadly, the answer isn’t as simple as “Amazon, yay!”
• Even they run out of resources over this period
• We conduct detailed YOY traffic projections
• We come up with a scaling plan to fit the projections
• Leave headroom!
Traffic Projection Tips
• Your system has various axes of scaling within it – trend and estimate them all
• We estimate incoming and outgoing reviews per day, peak requests per second on display servers, and calculate per-server acceptable capacity at each level (tomcat, Solr, database)
• Once you’ve done it one year, it’s easier because you can apply proportional lift to current traffic
• Keep an ear to the ground for environmental changes! This year retailers decided to start earlier and spike a little less on BF, so scaling came earlier than last year – but we read the news so we were prepared
0
200000000
400000000
600000000
800000000
1000000000
1200000000
1400000000
1600000000
PageviewsUGC Im-pressions
1.337 B1.330 B
Step 4: Supporting
Situational Awareness
• When the clock is running, you need your monitoring, alerting, response, etc. to be highly optimized for speed.
• We use a variety of monitoring types – nagios, zabbix, datadog, Keynote, pingdom
• And PagerDuty of course, aka “The One Ring”
• We write out runbooks for common response tasks such that we can have level 1 support people do them – or at least so that we don’t screw them up!
• Custom tooling is a must.
164k RPS
10 m2.xlarg
e
12 m2.xlarg
e
10 m2.xlarg
e
12k RPS
21k RPS
CDNHit Rate 80%TTL 600s
4330 ms
8210 ms
AWS East
AWS West
1023 ms
c1
3.4k RPS2340 ms
System Stats Histogram
3.4k RPS
1240 ms
c2
Demo!• https://monitoring.lab.bazaarvoice.com/dashboard/
currentperformance
Escalated Response
• We had 3x daily (9 AM, 2 PM, 9 PM) status calls for all teams to check in
• We sent out overall status system performance to the entire company daily
• Oncall shifts of 12 hours apiece – not fully online but not “waiting for pages” either, need to be eyeballing the system at regular intervals
Step 5: Practicing
Test Your Plan!
• Test your scaling
– Amazon limits are your enemy – there’s a thousand of ‘em and many are hidden
• Test your monitoring
• Test your paging
• Test your runbooks
• We had two “game days” to scale up, apply load, provoke issues and execute on remediation
Drag picture to placeholder or click icon to add
Step 6: Profit
How It Went Down
• 23 teams across R&D and Support
• 40 engineers participating as Black Friday representatives
• 11 weeks of planning
• 2 stress-testing "Game Days”
• 26 round-the-clock status calls (8 “yellow” status, 18 “green”)
• 35 issues examined during the period
• $136,620.27 for the week in hosting costs
• Zero downtime
November Performance (c3)
Questions?
Recruiting Moment - BV:IO 2014
• Bazaarvoice’s internal tech conference and hackathon!
• Last year: Alamo Drafthouse, Adrian Cockroft (Netflix), Jason Baldridge (UT), Nick Bailey (Datastax), Peter Wang (Continuum Analytics)
• This year: Norris Conference Center, Theo Schlossnagle (Circonus), Greg Brockman (Stripe CTF), Bob Metcalf (UT)
• Late-nighter hackathon to develop sweet social commerce solutions
• Plus – COD: Black Ops!
43
Register: bvio2014.eventbrite.com
Team Signups On Hacker League
Koderz Only