analyzing a complex cloud outage - cloudstack collaboration conference - vegas

1

Analyzing a Complex Cloud Outage

@botchagalupeVP of Services enStratus

Saturday, December 1, 12

John WillisCall me BotchagalupeVP of Services

WHO AM I

2


30 yrs itubuntu cloud evangelist startupOpscodeDTO awesome dudesEnstratus GR called..

GOALS

3

• Look at a complex cloud outage. • Understanding complexity. • Analyze a complex cloud outage.


Review bullets...

Amazon’s EBS Outage 10/22/2012

4


Fed reserve story just the WSJ partAmazon outages are big dealMinecraft


5



6


#Let’s take a look at the value stream of the service that failed on 10/22#If we look in the middle we see the green storage server#This box is a simplified process of a larger service (meaning many servers.. KISS 4now)#We always try to look at a VS from right to left.. (form the customer back)#In this example customers use something called EBS (block storage). Cloud based SAN)#Next the thing that is often left out of most VS’s the humans. In this ex they were an integral part of this system#Next we have an operations monitoring database (disk).. Thins the humans need to know about. #Could talk about autonomation (pre automation) Sakichi Toyoda auto stop the loom if threads broke or ran out.#Next there is the EBS server failover machine... Most production systems in large IT centers will have FA#It is important to realize that there are core services o this (e.g, EBS) and non core services on this box. ## Non core services are things like teh operations monitoring agent that feed into the monitor disk DB)## Also we will see in a minute there is a hardware agent on this EBS server for hardware detection failures#Next is a Fleet monitor server... basically a hardware monitor that can phone home or auto order defective parts from the manufacturer (in large infras like amazon, google, facebook this is common)#It has a FA server#Last but not least there is DNS as part of the VS.


6

The EBS System


#Let’s take a look at the value stream of the service that failed on 10/22#If we look in the middle we see the green storage server#This box is a simplified process of a larger service (meaning many servers.. KISS 4now)#We always try to look at a VS from right to left.. (form the customer back)#In this example customers use something called EBS (block storage). Cloud based SAN)#Next the thing that is often left out of most VS’s the humans. In this ex they were an integral part of this system#Next we have an operations monitoring database (disk).. Thins the humans need to know about. #Could talk about autonomation (pre automation) Sakichi Toyoda auto stop the loom if threads broke or ran out.#Next there is the EBS server failover machine... Most production systems in large IT centers will have FA#It is important to realize that there are core services o this (e.g, EBS) and non core services on this box. ## Non core services are things like teh operations monitoring agent that feed into the monitor disk DB)## Also we will see in a minute there is a hardware agent on this EBS server for hardware detection failures#Next is a Fleet monitor server... basically a hardware monitor that can phone home or auto order defective parts from the manufacturer (in large infras like amazon, google, facebook this is common)#It has a FA server#Last but not least there is DNS as part of the VS.


7

Server Failure


#on this one fine afternoon the fleet server died.. remember this is receiving data from the HW agents on the EBS server


8

Server Failover


#At this point there is most likely automated FA/HA# we see the FA server now supposed to be logically in the VS (the new arrow)# Any system thinkers out there see the first problem with this red circle?


9

DNS Propagation Failure


#The HA/HA seems to work flawlessly. From the FM guys perspective. #However our second problem happens and that is that DNS does not update it’s records correctly#therefore the HW agent running on the EBS server is still pointing to the down fleet server. (everyone see that?)


10

Agent Memory Leak


#Now a third problem jumps in.. (the yellow box)#When the HW agent tries to write back to the wrong FMS (the dead one) some not well tested code fails and creates a memory leak... #To make matters worse this particular agent is designed to be fault tolerant. In other words is should die silently and not disrupt any core service. E.g., it is designed to be ok to fail if it can’t send to the FMS. he assumption is that is will get it next time. #Now you can start seeing the fist level of a complex system emerge. ##The first issue seems to be fixed (the FMS FA)## DNS isn’t showing up as a failure on anybodies dashboard## and we have a silent error occurring on one of our core serves (a customer service)


11

EBS Service is slowing down


#The memory leak continues undetected and eventually it starts slowing down the EBS servicebecause of low memory...# Key point here is that is probably still not detected by the IT staff and maybe it’s just starting to annoy customers but not enough to turn the customer box yellow (yet)


12

Throttling the API


#At some point the IT staff notices the slowdown. We would hope before the customer complain and in AMZN’s case that is probably true (they are pretty good). #However, as we said earlier, they still don’t know why it’s slow... #Another bit of complexity is introduced here. ## The EBS servers always run hot (high) on memory. Therefore the undetected memory leak is most likely unnoticed at this point. (we will discuss this in detail when we get the the analysis part of this preso. .##from AMZN’s RCA is was pretty clear this was the case that they had not detected the mem leak ##Next a human interacting is take and that is they (the humans) decide to active a throttling tool ##They use this to throttle customer API requests as a stop gap to give them time to figure out and hopefully fix the issue (the slow down).


13

Customer Issues


#By now the customer is getting a double whammy. ## one, they were already experience slow responses from the service ##two, now the throttling has really made it worse for them


14

EBS Failover


#This situation continues on where teh IT staff still doesn’t know why the EBS service is having issues#The customers situation get worse. # And now the IT guys decide to punt again (like throttling). ## they force a FA/HA of the EBS service (servers). ##keep in mind they still don’t know what is wrong... gasping at straws twice now.. remember this 4 later. ##Notice now that the new HA/FA server is in place (show the arrow). ## COmplexity strikes again... This is a classic IT outage scenario.. where something seems to be fixed and when it really isn’t. ##The new FA server seems to have solved the problem. The new server is not slow at first... ##However, what they don’t know is all they have done is delay the inevitable. #the mem leak just starts all over again on the FA server. ##customer is still orange mainly because of throttling...


15

Twitter Effect


#At this point we start getting what are called indirect effects##The first effect (and this was in the RCA) is that suers’ tend to use more services when a potential outage is perceived. The start testing more services, trying other services. ##the next indirect effect is what I call the twitter effect. That is now the outage starts trending on twitter and everyone in extended system starts trying to kick the tires on AWS.. Let’s start up Netflix, I wonder if Guthub is working ok. ...


16

Failover Server Dies


#And of course the FA server eventually gets to the same state as the original EBS server#meanwhile it is very likely that the IT staff still does not know why this is all happening.


17

Systemic Outage


#Now our complete system is in a systemic failure... #Ironically the original failed over FMS is just fine (no red there). Now one is using it.. remember why?10 Minutes

Understanding Complexity

18


#So lets talk about complexity from a theoretical standpoint. #typically humans think linear. Our first instinct is that it’s always an X->Y#One variable X will change the outcome (y) - (Y is the dependent variable)--That is for an new improvement (change, bug fix, maintenance, etc..)--An emergency (like the amazon issues)--A new product, feature etc...


19


#However, in real life it’s never really x-<y it’s usually many In real life you get many variables#In statistics this referred to the don’t confuse correlation with causation#X->Y is correlation but it’s dangerous to assume it’s causation .. #real life is not that simple... we call it the messiness of like.

#X1 a simple server failure #X2 The failover#X3 The DNS

Deming wrote of Chanticleer, the barnyard rooster who had a theory. He crowed every morning, putting forth all his energy, flapped his wings. The sun came up. The connexion was clear: His crowing caused the sun to come up. There was no question about his importance. There came a snag. He forgot one morning to crow. The sun came up anyhow.


20

T1 T2


#You also more likely get time dependent variables that add to the complexity#X1-X3 happen at T1 andX4-X5 happen at T2

#X4 is the memory leak#X5The dreaded throttling...


21

T1 T2


#There are also indirect effects on the dependent variables (y)#for example X1 in concert with X4 can conjointly effect the dependent var Y## X1 changes X4 and the combination effect is different on Y##same with X5# This is a different model than a simple X->Y #X? The customer respond with more usage#X? The twitter effect

15 Minutes

22

W. Edwards Deming (1900 – 1993)

• Father of Quality• Understanding of the system• Understanding variation• Understanding human behavior• Introduced sampling into US Census • WWII success credited to his quality approach• Taught Japan after WWII and transformed quality• In 1980 Transformed American quality revolution• The Foundations of Six Sigma


#There is a tool that has been used by successful companies like Toyota, (lean) and many others. # Dr. Edwards Deming gave such a lens to break down complexity## (the real world just like a camera does)

20 Minutes

System of Profound Knowledge (SoPK)

23


#Let’s say a lens for improvement of something (an enhancement, a bug fix, new product idea)#An outcome X->y#Dr Deming gave us a tool called “The System of Profound Knowledge”

#Just common sense... However, Mark Twain said nothing common about common sense,,

#SoPK is a Lens to break down complexity and give ourselves an advantage to not over simplify what we are trying to do.I#n otherwords clear up the messiness of real life just like a camera lens does.

(

24

Knowledge of a System

• Systems Thinking• End to End Value Stream• What is the Aim of a System?• The Purpose of the System• Global Optimization• Not Local Optimization


(S) Appreciation of a System - Systems thinking - Deming would say understanding the AIM of a system. Deming said every system must have an AIM. Is your AIM to keep a server up or keep a protect a customer SLA (they might not be the same thing as we will soon see)Eli Goldarat (TOC) would say Global optimization over local optimization even if sub optimization is sub optimal. Understanding subsystems and dependent systems.

Knowledge of a System

25


One big exercise of non systems thinking... Clearly there were independant views of the systemWhat was the AIM of this systemDid the hardware guys have the same aim as the core services guys.

#Lens #1 Not having a systems view- Not seeing this as dependent systems. You might say surely they had automation to DNS. However I would say no. Because

Lens #2 The hardware guys should know that they are part of a bigger system other than just hardware monitor. They had code on a core service was it smoke tested immune tested.. Was there a systems view for QA and smoke testing of agent code changes?

Lens #3 (X->Y) Humans try to correct the memory issue with throttling and they don’t understand hardware monitoring as a sub system.. local optimization....

26

Knowledge of Variation

• There is always Variation• Special Cause Variation• Common Cause Variation• Understanding Variation


Continuous improvement requires the understanding of variation You have a power outage and it takes key personnel a long time to get to the data centerThat is special cause Var. A bad reaction would be to create a new policy that all personnel live with 5 minutes of the data center (i.e., treating it like common cause)Conversely. Firing a new programmer who brings down a production system would be treating common cause as a special cause situation. More than likely it was bad safeguards, insufficient training.... (V) Variation - Not understanding Variation is the root of all evil. Deming would get mad at ppl. Knee jerk reactions due to not understanding the kind of variation. How do you understand variation? Statistics (primarily STD and and it’s relationship to a process i.e., it’s distribution)Give you an example. A large cloud provider rates API calls at 100 (why 100) per (x). for Most customers that’s fine, however, others they get treated as DDOS. Where did they get 100? It had to be a guess. If they understood SPC (variation) they might come up with the number and have a CI process in place when they found special variation.

(

27

Control Chart


• #Approximately 99% - 100% of the values will fall within 3 standard deviations of the mean

• Approximately 90% - 98% of the values will fall within 2 standard deviations of the mean

• Approximately 60% - 78% of the values will fall within 1 standard deviation of the mean

• Approximately 90% - 98% of the values will fall within 2 standard deviations of the mean

• Approximately 60% - 78% of the values will fall within 1 standard deviation of the mean

Knowledge of Variation

28


The biggest issue here is the knee jerk reactions... Throttling and forced EBS server failover. They didn’t understand the type of variation.

Lens #1 The systems guys don’t understand common vs special cause variation .. they react to a “S” that should of been a “C”. Turns out ... monitoring sub processes monitoring looking at individual monitors... e.g., they might have gone from 95% to 96% which caused the issue. However, if they were looking at the individual agent memory they.

29

Theory of Knowledge

• Scientific Method• Knowledge Must Have Theory• Theory Must Have Prediction• Prediction Must Have Tests• Aim-->Measure-->Change


(K) is the simplest but hardest to understand by most ppl. Simply put it is using Scientific method to everything you do. Deming says you must have Theory to have knowledge and you can’t have knowledge with out prediction and you predication with out a test is useless.

PDSA others call it (AMC) AIM,Measure (a.what process u gonna change b.measure if the change worked), Change. You have to test any improvement to see if it worked, failed or did nothing. Imagine someone staring a failover system with automation but not testing to see if it really worked (could never happen).

Theory of Knowledge

30


#Lens #1 (K) Theory can not be an un measured guess. Whoever, did the failover (automation or manual) apparently didn’t have a proper measure for success. Should have verified that the they were actually using the new server (duh).

Lens #2 Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?

31

Theory of Psychology

• Understanding Behavior• Understanding Tribes• Understanding Worldviews


(P) Another easy one but hardest to implement. Understanding behavior. Why ppl do the things they do. Tribal behavior. Things that are important to one group might not be important to other groups. Understanding Human Behavior (another lens factor). Worldview's. Imagine a server that has software on it from two totally different dev groups. Further imagine these to group’s worldview are so far apart. One does agile CI, TDD, BDD, CD the other has never even hear of those things. (groupthink experiment)...

Theory of Psychology

32


Lens #1 (P) We could argue that maybe because the fleet servers are managed by hardware guys and DNS is by systems guys and may they’ are different cultural tribes and don’t understand the importance of each. Maybe they don’t go to lunch together.

Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they have the same Theories. Do the EBS guys do CD Smoke testing with hardware monitoring agents.

Lens #3 (P) Tribal understanding of behavior differences between Hardware guys and Systems guys.

Lens #4 (P) Not understanding customer behavior.. Customers increase there actions (API) calls testing services, qa, smoke.

Amazon’s Outage 10/22/2012

33

Let’s Review


#X1 a simple server failure #X2 The failover#X3 The DNS#X4 is the memory leak#X5 Bad TDD hygene by FMS eng/dev#X6 The dreaded throttling... #X7 The customer respond with more usage#X8 EBS Server failover#X9 The twitter effect

#The complexity was masked#This was not an X->Y#To bad they had not read deming...

Amazon’s Outage 10/22/2012

33

Let’s Review

X->YSaturday, December 1, 12

#X1 a simple server failure #X2 The failover#X3 The DNS#X4 is the memory leak#X5 Bad TDD hygene by FMS eng/dev#X6 The dreaded throttling... #X7 The customer respond with more usage#X8 EBS Server failover#X9 The twitter effect

#The complexity was masked#This was not an X->Y#To bad they had not read deming...

analyzing a complex cloud outage - cloudstack collaboration conference - vegas

Education