release it! - takeaways
TRANSCRIPT
§Release it! - Takeaways
1. Stability
2. Capacity
3. 0 downtime deployments
Agenda
§Stability
Major airline incident - introduction• Started with a planned
failover on the database
cluster that served Core
Facilities (CF)
• CF handled flight searches –
critical, so designed for high
availability
• CF was going to be used by
self-service check-in kiosks,
IVR, and “channel partner”
applications
Major airline incident – outage facts• Thursday evening, 11 pm: a team of engineers executed a manual
database failover from CF db1 to CF db2, then updated db1, then
migrated the database back to db1 and applied the same change to
db2
• 12:30 am: the crew marked the change as “Completed, Success”
and signed off (no downtime)
• 2:30 am: all the check-in kiosks in USA went red (stopped servicing
requests)
• minutes later: the IVR servers went red too
• A Severity 1 case was opened immediately
• Priority – restore service: restart CF and kiosks application servers
• Total elapsed time: approx. 3 hours
Major airline incident – consequences• Cost the company hundreds of thousands of dollars
• When the kiosks go down, off-shift agents are called in
• It took until 3 pm to deal with the backlog
• Delayed flights, reallocated gates
• Bad publicity for the airline in the media
• Affected FAA’s annual report card – measures customer
complaints, and on-time arrivals/departures (less money for
CEO)
Major airline incident – post-mortem• Data to collect:application servers: log files, thread dumps, and configuration filesdatabase servers: configuration files for the db and the cluster servercompare current db configuration files to those from the nightly backup
• Thread dumps: all threads blocked inside SocketInputStream.socketRead(), trying vainly to read a response that would never comeall threads had called: FlightSearch.lookupByCity()
Major airline incident – the culpritpublic class FlightSearch implements SessionBean {
private MonitoredDataSource connectionPool;
public List lookupByCity(. . .) throws SQLException, RemoteException {
Connection conn = null; Statement stmt = null;
try { conn = connectionPool.getConnection(); stmt = conn.createStatement(); //… } finally { if (stmt != null) { stmt.close(); } if (conn != null) { conn.close(); } } }}
What is stability• Transaction = an abstract unit of work processed by the system
• System = the complete, interdependent set of hardware,
applications, and services required to process transactions for
users
• Stability = system keeps processing transactions, even when
there are transient impulses, persistent stresses, or component
failures disrupting normal processing (users can still get work
done)
• A component of the system which starts to fail before everything
else does = crack in the system
• Cracks propagate!
• Tight coupling accelerates cracks
Major airline incident – avoid propagation• The pool could have been configured to create more connections
if it was exhausted or to block callers for a limited time, not
forever
• The client could have set a timeout on the RMI sockets
• CF servers could have been partitioned into more than one
service group
• Use a Circuit breaker
§Capacity
What is capacity• Performance measures how fast the system processes a single
transaction
• Throughput describes the number of transactions the system
can process in a given time span
• Capacity is the maximum throughput a system can sustain, for a
given workload, while maintaining an acceptable response time
for each individual transaction
Retailer incident• 300 people have worked for about 3 years to build a complete
replacement for the online store, content management, customer
service, and order-processing systems
• 9 am: the program manager hit the big red button and system
went live
• 9:05 am: 10,000 sessions active on the servers
• 9:10 am: 50,000 sessions active on the servers
• 9:30 am: 250,000 sessions active on the servers
CRASH!!!!
Retailer incident – reasons for failure• The number of sessions killed the site
• Each session got serialized and transmitted to a session backup
server after each page request (session replication enabled)
• Sessions were consuming RAM, CPU, and network bandwidth
• All load test scripts used cookies to track sessions
• In production:Search engines drove customers to old-style URLsSearch engine spiders expect the site to support session tracking via URL rewritingScrapers and shopbots did not handle cookies properly
Retailer incident – fixes• Use server scripting to protect the site
• Added a gateway page that served three critical capabilities:if the requester did not handle cookies properly, the page redirected the browser to a separate page that explained how to enable cookiesa throttle was set to determine what percentage of new sessions would be allowed to the real home pageblock specific IP addresses from hitting the site (shopbots, request floods)
§0 downtime deployments
0 downtime deployments - Expansion• Deploy new static files (images, stylesheets, JS)
• Create new service pools, if needed
• Add new tables
• Add new columns
• Run data migration scripts
• Add bridging triggers
• Apply recursive ZDD to prepare secondary clusters
0 downtime deployments - Rollout• For each server:
• Unpack code on the server
• Stop accepting new requests
• Shutdown the server
• Point to the new code
• Start up the server
• Verify clean startup
0 downtime deployments - Cleanup• Remove bridging triggers
• Remove obsolete referential integrity relations
• Remove obsolete columns
• Remove obsolete tables
• Add new referential integrity relations
• Add NOT NULL constraints
• Remove obsolete static files
• Remove the old code
• Remove old service pools
§Thanks!
Capacity antipatterns/patterns Resource pool contention
Excessive JSP fragments
AJAX overkill
Overstaying sessions
Wasted space in HTML
Reload button
Handcrafted SQL
Database eutrophication
Integration point latency
Cookie monsters
Pool connections
Use caching carefully
Precompute content
Tune the garbage collector