peter holditch devops
DESCRIPTION
The "devOps pay-rise" presentation I gave at tcube on 18th September 2014TRANSCRIPT
Realising the true value of DevOpsThe DevOps Payrise
@pholditch
Peter HolditchSenior Sales Engineer
DevOps?
Developers working together with
Operations to get things done faster in an
automated and repeatable way
DevOps Success?
Typical Dev Day1. Look at the overnight integration tests 2. Buy chocolates for the team if you broke the build 3. Scramble to fix the build 4. Pick the top priority item from your backlog 5. Start coding 6. Get dragged into troubleshooting prod. incidents 7. Hastily check in new code in as you ran out of time
What do developers care about?
Learn
InnovateEat Pizza
What does development really care about?
What did the Business care about?
£
Features = £Even though the business never measured it.
“Everything is fine from our end.”
OPS:
Typical Ops Day1. Open 30 new tickets 2. Make 200 phone calls 3. Attend executive P1 status update meeting 4. Argue about what a P1 and P2 really is 5. Reprioritise P2 tickets to P1 6. Reprioritise P3 tickets to P2 7. Close tickets as ‘Cannot reproduce’ or ‘Duplicate’
What do operators care about?
P1’sSLA’s
What does operations really care about?
What did the Business care about?
£
P1 = £Even though the business could never prove it.
How the Business often view dev & ops
How L2 & L3 Support often view dev & ops
False Alarms
Site is down
404 Errors
My search is slow
2am Friday - #FFS
We have had an alert that the load on one of your staging servers is critical.
How much time do false alarms waste?
Role Hours Per Week Cost Per Week Cost Per Year
Ops 20 £400 £20,800
L2 10 £200 £10,400
L3 15 £300 £15,600
Hosting 6 £120 £6240
Network 6 £120 £6240
CMS 10 £200 £10,400
Total 55 £1,340 £69,680
Conservative estimates assuming £20/hour
How much revenue did the business lose?
No idea
Typical Day1. Open 30 new tickets 2. Make 300 phone calls 3. Attend executive P1 status update meeting 4. Argue about what a P1 and P2 really is 5. Reprioritize P2 tickets to P1 6. Reprioritize P3 tickets to P2 7. Close tickets as ‘Cannot reproduce’ or ‘Duplicate’
1. Look at the overnight integration tests 2. Buy chocolates for the team if you broke the build 3. Scramble to fix the build 4. Pick the top priority item from your backlog 5. Start coding 6. Get dragged into troubleshooting prod. incidents 7. Hastily check in new code in as you ran out of time
Things that would help
1. Automation
2. Collaboration
3. Better Tooling
4. Business Metrics
Things that could justify them1. Baseline the starting point
2. Measure progress
3. Calculate Business Impact
4. Promote success not problems
5. Demonstrate value
Modern-day User Expectations…
3 billion daily transactions
250 milliseconds
500+ updates/yr
Spot the App…
1 million+ servers
100 million GB
1,000 man years
1,500 miles
Konstantin Karpov
Users Expectations
Web server 1
Internet FirewallLoad
Balancer
Web server 2
Database
Napkin architecture…
Key:
= bad
= not bad
Pre$Produc)on+APM+–+“Non+Produc)on+Data”+
Development Operations
Dev Test Staging Live
Monitor & Manage Profile QA Load Test
Pre-Production Production
Produc'on)APM)–)“Produc'on)Data”)
6
Development Operations
Dev Test Staging Live
Monitor & Manage
Pre-Production Production
Profile QA Load Test
tools can be helpful
right tools
right hands
right use
How much time and £ do these tools save?
INFRASTRUCTURE AUTOMATION
How much time and £ do these tools save?
DEPLOYMENT AUTOMATION
How much time and £ do these tools save?
LOG AUTOMATION
LogStash
Monitoring
How much time and $ do these tools save?
severe outage?
PLAN FOR FAILURE!be stronger than the weakest link
Traditional monitoring approach is limited
APPLICATION
BUSINESS TRANSACTION
Server
OS DB
MQ
Web
JVM
Silo’d domain visibility
EXISTING APPROACH
EXPANDED APPROACH
Business transaction
99.9% 99.9% 99.9%99.9%
END USER EXPERIENCE
How many of you use performance
management tools?
Identify early !Troubleshoot fast !Resolve quickly !Quantify impact
x
FOCUS
Big is BAD
data
66
monitoringBig is BAD
data
Keep Everything?
51
52
Keep Nothing?
just what you need
serverscores storage80TB 92700
MONITORING ENVIRONMENT
8%
servers1200
trans/min300,000
IT ENVIRONMENT
smart data
actionable, intelligent, information
IS THIS PERSON PERFORMING WELL?
Blood pressure!165/100!
Heart rate!150bpm!
57
are we talking about this person?
OR this person?
Attribute Person 1 Person 2Heart Rate 150 150
Blood Pressure 180/90 180/90
Eye Color Blue BrownBlood Type O+ O-
White Blood Cell Count 3.5 3.8Hair Color Brown Blue
Height 180cm 175cmShoe size 11 10
Weight 180kg 94kgCurrent activity sitting skating
What data could we collect?
IS PERSON 2 PERFORMING WELL?
Time 12min 44sec!
Distance 10,000 metres!
Record time: 12min 58secbaseline
New Olympic Record Jorrit Bergsma 10,000m winner
average response time with historical baseline
User & IT perspective
Analytics
Correlation
Intelligent alerting
Resolution path
monitoring platforms should do the heavy lifting
64
Don’t be this person…
65
plan ahead
anticipate needs
intended purpose
And remember: Monitoring is not all traffic lights…
Understand the impact of slow performance
* Screenshot from US e-Commerce AppDynamics Customer
Application Revenue
Application Errors
Application Response time
$64,499 per min
$11,987 per min
10.1 s
100 ms
Understand the benefit of an application release
Application Revenue
Application Response time
code release 1
code release 2
code release 3
$44,499 per min
$58,237 per min
1.9 s3.1 sec