production assurance - cdm media · reactive: intervention here at risk of dealing with the...

25
© 2013CPT Global Limited Independent Experience CIO Forum Savoy Hotel 5 th December 2013 CPT Global – Production Assurance Alan Sloan – General Manager, CPT Global (Europe) Alan Mackenzie – CTO, CPT Global

Upload: others

Post on 05-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

© 2013CPT Global Limited

Independent Experience

CIO Forum

Savoy Hotel 5th December 2013 CPT Global – Production Assurance

Alan Sloan – General Manager, CPT Global (Europe)

Alan Mackenzie – CTO, CPT Global

Page 2: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

December 2013 CPT Production Assurance

About CPT Global

2

• Melbourne

• Sydney

• Canberra

• London

• Munich

• Paris

• New York

• Toronto

• Singapore

• Sao Paulo

Founded in Melbourne 1993, listed 2000 (ASX:CGO)

Over 170 specialist consultants

Annual turnover $AUD40million +

Operations in five continents

Australia since 1993

Europe since 1998

North America since 2002

South America since 2012

Asia since 2012

Page 3: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

What do we do?

CPT‟s Services are based around our core disciplines of

Capacity Planning

Performance Management

Testing Services

MIT

Our key services have been created to provide

IT Cost Reductions and efficiency programs

Production Assurance

Test Effectiveness and Management

Independent client-side consulting

3

December 2013 CPT Production Assurance

Page 4: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

This session and CPT

This session will focus on Production Assurance, it‟s importance

in IT Estates and what we have learned

Our technical consultants are IT fire-fighters specialising in

resolution of performance and stability issues, often when

Problem has reached crisis point

Internal and vendor efforts to resolve have failed

BUT, preferably as early as possible to ensure pro-activeness

We consequently get exposed to diverse

Problems

Clients and environments

Technology combinations

4

December 2013 CPT Production Assurance

Page 5: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Production Assurance Background

5

December 2013 CPT Production Assurance

Over 20 years as a provider of specialist technical services, CPT has worked

with a broad number of clients all trying to address common problems and

concerns:

Will my application actually deliver to the expectations of the business

stakeholders ?

Why is my application now not meeting SLAs ?

Will my application scale effectively for business growth ?

The cost of my application / system is escalating

CPT has been involved in a number of situations where the application is live

and not performing, or undergoing development/testing where concerns have

been raised leading to our involvement as an organisation to solve the

problems.

Page 6: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Production Assurance - Cause

6

December 2013 CPT Production Assurance

Typically, we find problems are typically caused by one or more of the following:

Business Pressure resulting in

A rush to get new applications and changes in to production

Minimal or no design considerations for performance and scalability

Limited understanding of the characteristics of underpinning technologies

Inadequate Non-Functional Testing (Stress / Performance etc.)

Poor capacity planning and modelling

Poor communication between groups

Conflicts internally and externally

SILOS of activity

Can be particularly prevalent with out-sourcing either application development or infrastructure services

The Following slides cover off the typical methods for dealing with Production Assurance

Page 7: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Assurance Framework: Benefits v Cost

7

December 2013

Reactive: Intervention

here at risk of dealing

with the symptoms and

not the root cause(s).

Analysis Design Code Test Implement

$ Cost

Risk

The later in the SDLC an Assurance Framework is implemented, the greater the

investment required and the less sustainable will be the improvements.

Bad practices earlier within the SDLC will remain untreated and will continue.

Whilst improvements in production will be more immediate, the impact of assurance

intervention activities will be more intensive which will be likely to slow down the

delivery process through increased prevention techniques.

The earlier in the SDLC where an Assurance

Framework is implemented, the more

sustainable the framework becomes.

Whilst the ROI may be longer, potential risks can

be identified and mitigated earlier.

Reactive: Service Restoration

Proactive: Service Assurance

Predictive: Risk Avoidance

Without an assurance framework, the cost of mitigating and remedying risks increases progressively in the SDLC.

Proactive: Intervention here at

established quality gates will

begin to identify risks and

mitigation approaches earlier

in the SDLC.

Predictive: A mature assurance framework underpinned by established quality gates inserted into the SDLC with

accompanying processes. The objective: identify, assess the nature and determine the size/impact of the risk as

early as possible in the SDLC. Early enough to be able to implement mitigation solutions prior to implementation.

CPT Production Assurance

Page 8: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Situation and response

8

December 2013 CPT Production Assurance

Firefighting – Reactive

Reactive Preventative

This is often where CPT gets involved

Performance Risk

Sta

bilit

y R

isk

Page 9: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

9

December 2013 CPT Production Assurance

Reactive

Reactive

Preventative

Situation and response

Performance Risk

Sta

bilit

y R

isk

Firefighting

Page 10: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

10

December 2013 CPT Production Assurance

Situation and response

Performance Risk

Sta

bilit

y R

isk

Reactive Firefighting

Reactive Preventative

Page 11: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Fire-Fighting: Service Restoration

Situation:

Re-Active: Restore a production service back to acceptable levels of performance, stability or

cost.

Approach:

Assess – Fact based analysis – seeking to define the exact nature of the “problem” as experienced

by the users

Analysis of the service/application current state and future state goals including:

The technical environment

Hardware configuration, System software, Database configuration

System utilisation metrics covering CPU, Memory, Storage, Network, Database utilisation and activity metrics

Requires access to system, middleware, application logs etc

Plan - Intensive triage and diagnosis activities leading to proposed solutions, recommendations

and prioritisation of each issue found.

Implement – implement recommendations through the change management framework.

Review - Post implementation measurement (% improvement) and review.

Typically an iterative process involving testing and measurement activities in a non-production

environment to mitigate risks by proving the benefit of proposed changes.

11

June 2013 CPT Production Assurance

Page 12: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Fire-Fighting: Service Restoration

Short-Term Desired Outcomes:

Recommendations identified to address the observed “problems”.

Implementation of “quick win” recommendations through to production, within

the governance and change management framework balanced against the impact

of the problems and the nature and supporting argument for the proposed

change.

Medium-Term Desired Outcomes

A report documenting a series of tactical and strategic recommendations that

will resolve the current issues and prevent other problems occurring in the

future. The recommendations should be categorised by priority, benefit and

ease of implementation.

Ensure monitoring and gating processes are in place for future growth and

change, and where applicable, a continuous improvement program is included.

12

June 2013 CPT Production Assurance

Page 13: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Examples of Fire-Fighting - Reactive

Upgrading infrastructure produced widely varying response times – batch

run from 40 hours to 147 hours risking missing end of year processing

deadlines.

Oracle bug

Recode SQL

Rogue application taking 4 hours to perform data extract was reduced to

less that 4 minutes

Java Application design

A new Settlements application was not going to scale effectively. Needed

complete re-design of 75% of the application which effectively provided a

6-fold improvement in throughput, whilst reducing the Hardware

requirements.

13

December 2013 CPT Production Assurance

Page 14: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Examples of Fire-Fighting - Reactive

Batch over-runs in a funds trading system of a major European Bank.

Application had been ported from the IBM Mainframe to Unix / Oracle and

Service Provider said there was nothing that could be done except to buy

more Hardware or re-write the application. CPT were asked to review try

and resolve issues;

Changes recommended to the Oracle Dbase to prevent death by random I/O

Reconfigured Back-up processing

Changes made to memory allocation between batch and online processing to

exploit the free memory overnight and avoid I/Os

Some minor SQL changes also recommended

Batch times almost halved, ensuing SLAs could be met consistently and a

level of contingency provided, existing resources better utilised and

hardware upgrade avoided

14

December 2013 CPT Production Assurance

Page 15: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Pro-Active: Service Assurance

Situation:

Pro-Active: Part of the overall design/development/testing/tuning lifecycle to

ensure that the application will perform as anticipated once it is implemented

into production.

Approach:

Insertion of key resources into capability areas where deficiencies may exist. In

past experiences, CPT has typically found expertise gaps across:

Test Environment Management

Test Management

Performance Engineering and Performance Tuning

Stress, Volume and Performance Testing

Capacity Planning

Governance

15

June 2013 CPT Production Assurance

Page 16: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Pro-Active: Service Assurance

16

June 2013 CPT Production Assurance

Outcomes:

A combination of services provided and specific outcomes and artefacts

that will attest that the system will perform as expected on the

production infrastructure, supporting the anticipated volume of users

and transactions, with all known risks documented.

The services and outcomes/artefacts can encompass all or some of the

following dependant upon the system requirements:

Test Management: Management, co-ordination and oversight, Test Strategy, Test Planning,

Stakeholder Reporting

Test Environment Management: Environment Strategy, Test Data Strategy, Environment

Build services, Environment Co-ordination, Environment Technical Support

Performance engineering and Performance tuning

Stress, Volume and Performance Testing: Strategy, Planning, Scripting, Data Preparation,

Execution, Tuning and Reporting

Capacity Planning: Capacity Plan / Infrastructure Capacity Assessment

Page 17: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Predictive: Risk Avoidance

Situation:

Pre-Emptive: Undertake business and technical focused risk assessments

of the applications and underlying infrastructure.

Approach:

Ensure application and infrastructure designs are signed off for

Performance, Scalability and Stability during the initial planning stages,

as well as reviewing key components during development

Expertise must include at a minimum the following capabilities:

Capacity Planning

Architecture

Performance Tuning

Performance, Stress and Volume Testing

17

June 2013 CPT Production Assurance

Page 18: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Predictive: Risk Avoidance

Outcomes:

A performance sign-off for each step of the SDLC.

Typically a Capacity Plan / Infrastructure Capacity Assessment dependant upon

the application requirements. input to the report would leverage from current

and planned production users / transaction volumes combined with Stress,

Volume and Performance Testing outcomes.

Pro-active identification of infrastructure components at risk and the required

capacity and timeframes by which the additional capacity should be provisioned.

18

December 2013 CPT Production Assurance

Page 19: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Situation and response

19

December 2013 CPT Production Assurance

Service restoration

Application

Assessment

& Tuning

Monitoring & Design/

Change considerations

Availability

Management

& Planning

Performance Risk

Sta

bilit

y R

isk

Page 20: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Case Study – Fire-Fighting to Pre-emptive

Background – Major Telco in North America

Introducing Sales and Service Portal SSP planned to replace all existing

portals for shops, booths, home customers covering wireless, cable, etc.

5 Pilot users in a couple of stores using new product, but not under an stress

Feedback on user experience was ridiculous with dreadful response times –

this was supposed to scale initially to 10,000+ users

Management confidence very low

Two compelling events

iPhone 5C launch - September 2013 – high demand for on line activation

Back to school 2013 - „Cable Christmas‟ – high demand for cable as mums and

dads pay for college students cable in their new student accommodation

20

December 2013 CPT Production Assurance

Page 21: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Working backwards over a period of 6 months

Identify and fix immediate performance problems and get pilot to run better

Put in place SLAs to ensure future success criteria were established

First increase in confidence levels

Root cause analysis of performance problems

Performance testing leading to: -

Capacity planning and modelling to assess scalability to 10,000 users

Recommendations

Applications design

Infrastructure improvements

Intensive monitoring of SSP through a „war room‟ over „Back to School 2013‟. SSP processed highest volumes ever

iPhone Launch – similar approach – successful, uninterrupted launch

Embedded a coherent approach which encompassed on going performance tuning, capacity planning, embargo planning, incident management, production monitoring, command centre management

21

December 2013 CPT Production Assurance

Case Study – Fire-Fighting to Pro-Active

Page 22: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Flow-on Effects from Production Assurance

22

December 2013 CPT Production Assurance

Some of the flow on effects for moving towards a more pro-active Production Assurance include

Cost Reductions through immediate and ongoing consumption reduction

Depending on commercial structures in place – In-Sourced / Out-Sourced, MLC etc.

Cost Avoidance through reducing or negating the need for HW Upgrades

Using less consumption will ensure a smaller footprint required as the application is rolled out or business volumes increase etc.

Improved Scalability costs

Faster Testing cycles

A better managed Test environments will mean less hardware and faster testing turnaround.

Better focused testing and and analysis of results

Improved relations with business sponsors

SLAs clearly defined, and monitored to ensure they are being met, and if they are at risk, pro-actively tackling the problem before it impacts the business.

Page 23: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

Questions ?

23

December 2013 CPT Production Assurance

Page 24: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

IT Management Consulting: ICT Strategic Planning (O)

ITIL Implementation (S, O)

IT Effectiveness and Change (O)

Business Measurements of IT Services (O)

Selective Sourcing Strategies (O)

Enterprise Architecture (S, O)

Program and Project Management (S)

Program and Project Lifecycle Reviews (S, O)

Business Case Development and Reviews (S)

Development of Business Requirements for Market Testing (S)

Package Assessments and Evaluations (S)

Troubled Project Reviews and Remediation (S, O)

Production Assurance – CPT Service Lines

24

December 2013 CPT Production Assurance

Capacity & Performance Services:

Capacity Planning

Capability Reviews (S, O)

Frameworks and roadmaps (S, O)

Reviews and assessments (T)

Plans: Application / Infrastructure /

Enterprise (T)

Modelling and forecasting (T)

As a service / on demand / hybrid (T, S, O)

“Cost of Running” modelling and reporting (T)

Performance Management (T, S, O)

Performance Tuning Services

Application /infrastructure specific (T)

Corporate wide programmes (S, O)

Performance Diagnosis (T)

Technical Support Resources (T)

Database Administrators

Systems Programmers

Systems Administrators

Analyst/Programmers

Architecture Services (T, S)

Technical Architecture and Design Reviews

Technical Test Environment Support and Build

Resources (T, S)

Testing Services:

Capability and Effectiveness

Testing Capability and Maturity

Reviews (S, O)

Testing Lifecycle Effectiveness (S)

Test Strategies (S)

Service Validation and Testing (S)

Test Data Extraction, Reduction

and Masking (S)

Testing Tools and Techniques (S)

Test Coverage (S)

Test Reporting and Metrics (S)

Operational

Test Management (T)

Stress and Volume/Performance

Testing (T)

Non-functional Testing (T)

Specialist Test Execution (T)

Test Automation Techniques,

Strategies and Execution (T, S)

Test Environment Management

and Optimisation (T, S)

Test Data Extraction, Reduction

and Masking (T, S)

Production Assurance Services Legend: T = Tactical, S = Strategic, O = Organisational

Page 25: Production Assurance - CDM Media · Reactive: Intervention here at risk of dealing with the symptoms and not the root cause(s). Analysis Design Code Test Implement $ Cost Risk The

About CPT Global

CPT Global is a specialised consultancy with two focus areas. Its Technical Consulting services enhance the

control, quality, stability, efficiency and reliability of all technology platforms with core offerings of

Capacity Planning, Performance Management and Testing. Its Management Consulting services review and

improve the business processes associated with Information Technology, with offerings that include Program

and Project Management, IT Governance Reviews, Strategic Sourcing Strategies and Technology Transition

Planning.

AUSTRALASIA

Melbourne

Level 1, 4 Riverside Quay

Southbank VIC 3006

Telephone +61 3 9684 7900

Facsimile +61 3 9684 7999

Sydney

Suite 3, Level 5, 80 Clarence St

Sydney, NSW 2000

Telephone +61 2 8234 7400

Facsimile +61 2 8234 7499

Canberra

Level 4, 161 London Circuit

Canberra City ACT 2601

Telephone +61 2 6206 9700

Facsimile +61 2 6206 9799

Singapore

10 Anson Road, #32/15

International Plaza,

Singapore 079903

Telephone: +65 6226 2555

AMERICAS

New York

410 Park Avenue, 15th Floor,

New York NY 10022

Telephone +1 917 210 8668

Facsimile +1 917 210 8182

Toronto

100 King Street West, Suite 3700

Toronto, Ontario M5X 1C9

Telephone +1 416 642 2886

Facsimile +1 416 644 8801

Sao Paulo

Al. Europa 1206 – Santana de

Parnaiba - SP – CEP 06543-325

Telephone: +55 11 8454-0869

EUROPE

London

Parkshot House, 5 Kew Road

Richmond, Surrey, TW9 2PR

Telephone +44 20 8334 8085

Facsimile +44 20 8334 8541

Munich

Landsberger Str. 302

D-80687 Munich

Telephone +49 89 9040 5955

Facsimile +49 89 9040 5965

Paris

140 bis rue de Rennes

75006 Paris

Telephone +33 1 70 38 23 21

Facsimile +33 1 70 38 23 00

www.cptglobal.com

25

December 2013 CPT Production Assurance