release it! - takeaways

22
§ Release it! - Takeaways

Upload: manuela-grindei

Post on 14-Jan-2017

98 views

Category:

Software


3 download

TRANSCRIPT

Page 1: Release it! - Takeaways

§Release it! - Takeaways

Page 2: Release it! - Takeaways

1. Stability

2. Capacity

3. 0 downtime deployments

Agenda

Page 3: Release it! - Takeaways

§Stability

Page 4: Release it! - Takeaways

Major airline incident - introduction• Started with a planned

failover on the database

cluster that served Core

Facilities (CF)

• CF handled flight searches –

critical, so designed for high

availability

• CF was going to be used by

self-service check-in kiosks,

IVR, and “channel partner”

applications

Page 5: Release it! - Takeaways

Major airline incident – outage facts• Thursday evening, 11 pm: a team of engineers executed a manual

database failover from CF db1 to CF db2, then updated db1, then

migrated the database back to db1 and applied the same change to

db2

• 12:30 am: the crew marked the change as “Completed, Success”

and signed off (no downtime)

• 2:30 am: all the check-in kiosks in USA went red (stopped servicing

requests)

• minutes later: the IVR servers went red too

• A Severity 1 case was opened immediately

• Priority – restore service: restart CF and kiosks application servers

• Total elapsed time: approx. 3 hours

Page 6: Release it! - Takeaways

Major airline incident – consequences• Cost the company hundreds of thousands of dollars

• When the kiosks go down, off-shift agents are called in

• It took until 3 pm to deal with the backlog

• Delayed flights, reallocated gates

• Bad publicity for the airline in the media

• Affected FAA’s annual report card – measures customer

complaints, and on-time arrivals/departures (less money for

CEO)

Page 7: Release it! - Takeaways

Major airline incident – post-mortem• Data to collect:application servers: log files, thread dumps, and configuration filesdatabase servers: configuration files for the db and the cluster servercompare current db configuration files to those from the nightly backup

• Thread dumps: all threads blocked inside SocketInputStream.socketRead(), trying vainly to read a response that would never comeall threads had called: FlightSearch.lookupByCity()

Page 8: Release it! - Takeaways

Major airline incident – the culpritpublic class FlightSearch implements SessionBean {

private MonitoredDataSource connectionPool;

public List lookupByCity(. . .) throws SQLException, RemoteException {

Connection conn = null; Statement stmt = null;

try { conn = connectionPool.getConnection(); stmt = conn.createStatement(); //… } finally { if (stmt != null) { stmt.close(); } if (conn != null) { conn.close(); } } }}

Page 9: Release it! - Takeaways

What is stability• Transaction = an abstract unit of work processed by the system

• System = the complete, interdependent set of hardware,

applications, and services required to process transactions for

users

• Stability = system keeps processing transactions, even when

there are transient impulses, persistent stresses, or component

failures disrupting normal processing (users can still get work

done)

• A component of the system which starts to fail before everything

else does = crack in the system

• Cracks propagate!

• Tight coupling accelerates cracks

Page 10: Release it! - Takeaways

Major airline incident – avoid propagation• The pool could have been configured to create more connections

if it was exhausted or to block callers for a limited time, not

forever

• The client could have set a timeout on the RMI sockets

• CF servers could have been partitioned into more than one

service group

• Use a Circuit breaker

Page 11: Release it! - Takeaways

§Capacity

Page 12: Release it! - Takeaways

What is capacity• Performance measures how fast the system processes a single

transaction

• Throughput describes the number of transactions the system

can process in a given time span

• Capacity is the maximum throughput a system can sustain, for a

given workload, while maintaining an acceptable response time

for each individual transaction

Page 13: Release it! - Takeaways

Retailer incident• 300 people have worked for about 3 years to build a complete

replacement for the online store, content management, customer

service, and order-processing systems

• 9 am: the program manager hit the big red button and system

went live

• 9:05 am: 10,000 sessions active on the servers

• 9:10 am: 50,000 sessions active on the servers

• 9:30 am: 250,000 sessions active on the servers

CRASH!!!!

Page 14: Release it! - Takeaways

Retailer incident – reasons for failure• The number of sessions killed the site

• Each session got serialized and transmitted to a session backup

server after each page request (session replication enabled)

• Sessions were consuming RAM, CPU, and network bandwidth

• All load test scripts used cookies to track sessions

• In production:Search engines drove customers to old-style URLsSearch engine spiders expect the site to support session tracking via URL rewritingScrapers and shopbots did not handle cookies properly

Page 15: Release it! - Takeaways

Retailer incident – fixes• Use server scripting to protect the site

• Added a gateway page that served three critical capabilities:if the requester did not handle cookies properly, the page redirected the browser to a separate page that explained how to enable cookiesa throttle was set to determine what percentage of new sessions would be allowed to the real home pageblock specific IP addresses from hitting the site (shopbots, request floods)

Page 16: Release it! - Takeaways

§0 downtime deployments

Page 17: Release it! - Takeaways

0 downtime deployments - Expansion• Deploy new static files (images, stylesheets, JS)

• Create new service pools, if needed

• Add new tables

• Add new columns

• Run data migration scripts

• Add bridging triggers

• Apply recursive ZDD to prepare secondary clusters

Page 18: Release it! - Takeaways

0 downtime deployments - Rollout• For each server:

• Unpack code on the server

• Stop accepting new requests

• Shutdown the server

• Point to the new code

• Start up the server

• Verify clean startup

Page 19: Release it! - Takeaways

0 downtime deployments - Cleanup• Remove bridging triggers

• Remove obsolete referential integrity relations

• Remove obsolete columns

• Remove obsolete tables

• Add new referential integrity relations

• Add NOT NULL constraints

• Remove obsolete static files

• Remove the old code

• Remove old service pools

Page 20: Release it! - Takeaways

§Thanks!

Page 21: Release it! - Takeaways
Page 22: Release it! - Takeaways

Capacity antipatterns/patterns Resource pool contention

Excessive JSP fragments

AJAX overkill

Overstaying sessions

Wasted space in HTML

Reload button

Handcrafted SQL

Database eutrophication

Integration point latency

Cookie monsters

Pool connections

Use caching carefully

Precompute content

Tune the garbage collector