journal of computing::review of it control...

12
Vol. 4, No. 11 November 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2013 CIS Journal. All rights reserved. http://www.cisjournal.org 857 Review of IT Control Chart Igor Trubin PhD, Vice Chair of Southern CMG ABSTRACT The Control Chart is one of the main Six Sigma tools to optimize business processes. After some adjustments it is used now as visualization tool in IT Capacity Management especially in “behavior learning” products to underline performance and capacity usage anomalies. This review answers the following questions. What is the Control Chart and how to read it and where to use? Review of some performance tools that use it. Control chart types: MASF charts vs. classical SPC; introduction to IT-Control Chart for IT application performance control. How to build a Control Chart using Excel for interactive analysis and R scripting to do it automatically? Keywords: Control Chart, Six Sigma Tools, SPC 1. INTRODUCTION One of the most powerful ways to visualize systems behavior is using the Control Chart. Recently this statistical tool has become popular in the latest generation of capacity and availability management tools. In this paper the following items related to Control Charts are discussed: - Where and why the Control Chart is used. Review some systems performance tools on the market that build and use control charts. - What is the Control Chart? - A little bit of theory and history. - How SETDS [2] (Statistical Exception and Trend Detection System) uses it - MASF vs. SPC (Shewhart) charts. - IT-Control Chart new concept [8]. The best control chart type for IT data visualization. - Sizable display of already published charts in www.CMG.org papers. - Plus some new ones with explanations of how to read them and how they complement other types of charts. - How to build a Control Chart: using Excel for interactive analysis and R [7] to automate the control chart generation, with a demonstration of coding and technique. Fig 1: Examples of Control Charts Used in Capacity and Availability Managemen

Upload: others

Post on 22-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Journal of Computing::Review of IT Control Chartcisjournal.org/journalofcomputing/archive/vol4no11/vol4no11_6.pdf · control chart type for IT data visualization. ... MASF, SPC Control

Vol. 4, No. 11 November 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

857

Review of IT Control Chart Igor Trubin

PhD, Vice Chair of Southern CMG

ABSTRACT The Control Chart is one of the main Six Sigma tools to optimize business processes. After some adjustments it is used now as visualization tool in IT Capacity Management especially in “behavior learning” products to underline performance and capacity usage anomalies. This review answers the following questions. What is the Control Chart and how to read it and where to use? Review of some performance tools that use it. Control chart types: MASF charts vs. classical SPC; introduction to IT-Control Chart for IT application performance control. How to build a Control Chart using Excel for interactive analysis and R scripting to do it automatically? Keywords: Control Chart, Six Sigma Tools, SPC 1. INTRODUCTION

One of the most powerful ways to visualize systems behavior is using the Control Chart. Recently this statistical tool has become popular in the latest generation of capacity and availability management tools. In this paper the following items related to Control Charts are discussed:

- Where and why the Control Chart is used. Review some systems performance tools on the market that build and use control charts.

- What is the Control Chart? - A little bit of theory and history.

- How SETDS [2] (Statistical Exception and Trend Detection System) uses it - MASF vs. SPC (Shewhart) charts.

- IT-Control Chart new concept [8]. The best control chart type for IT data visualization.

- Sizable display of already published charts in www.CMG.org papers.

- Plus some new ones with explanations of how to read them and how they complement other types of charts.

- How to build a Control Chart: using Excel for interactive analysis and R [7] to automate the control chart generation, with a demonstration of coding and technique.

Fig 1: Examples of Control Charts Used in Capacity and Availability Managemen

Page 2: Journal of Computing::Review of IT Control Chartcisjournal.org/journalofcomputing/archive/vol4no11/vol4no11_6.pdf · control chart type for IT data visualization. ... MASF, SPC Control

Vol. 4, No. 11 November 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

858

2. WHERE AND WHY THE CONTROL

CHART IS USED IN IT Control charts have been used by several System

Management tool vendors as a main or an auxiliary visualization sub-tool. Some general purpose statistical tools have the ability to build, interactively or programmatically, some type of control charts (SAS, JMP and other) - those are used for ad-hoc reporting or building “home-made” systems, such as SEDS/SETDS – Statistical Exception and Trend Detection System, where control charting (IT version) is used as a main reporting tool [2, 3, 4, 5].

Other products have such build-in control chart generators as a reporting feature. Figure 1 shows some examples of control charts used in different products (tools) and were published in some CMG papers and presentations. The latest generation of Capacity and Availability Management tools must have control charts in order to be truly proactive.

More and more vendors have realized this and have been adding the feature to their products, especially in the most modern system management tools of the learning behavior (or self-learning) type. That’s why any Capacity Management specialist must be familiar with the concept of Control Charts.

Why are control charts used for Capacity and Availability Management?

- A Control Chart has the ability to uncover some hidden trends and patterns of systems performance data;

- A Control Chart is a truly proactive tool and

could capture unusual resource usage before something breaks [3];

- A Control Chart is one of the best base lining

tool and can show how actual data deviate from static or dynamic historical baseline;

- A Control Chart provides a dynamic

threshold: there is no need for manual threshold settings;

- A Control Chart can detect a workload

pathology (run-away, memory leaks and others)

3. WHAT IS THE CONTROL CHART Definitions:

- The control chart, also known as the Shewhart chart or process-behavior chart, is a tool used in statistical process control (SPC) to determine whether a manufacturing or business process is in a state of statistical control or not.

- It is a graphical tool for monitoring changes that

occur within a process, by distinguishing variation that is inherent in the process (common cause) from variation that yields a change to the process (special cause). This change may be a single point or a series of points in time - each is a signal that something is different from what was previously observed and measured.

Fig 2: Classical Control Chart and Histogram Showing Outlier Event of CPU utilization

A control chart usually shows a metric plotted

against time, with a centerline, upper control limit and lower control limit superimposed on that chart. The purpose of the control limits is to indicate where the process is in control. There are several common choices for the upper and lower control limits. Choice of limits

(as shown on Figure 2):

UCL= Mean + 3σ; Centerline = Mean (or Average) LCL= Mean - 3σ;

σ - Standard Deviation. The reason a control limit of 3σ balances the risk of error is that, for normally distributed data, data points will fall inside the 3σ limits 99.7% of the time when a process is in control.

Other Choice of limits:

UCL= 95th Percentile; Centerline = 50th Percentile LCL= 5th Percentile A percentile or centile is the value of a variable

below which a certain percent of observations fall.

Page 3: Journal of Computing::Review of IT Control Chartcisjournal.org/journalofcomputing/archive/vol4no11/vol4no11_6.pdf · control chart type for IT data visualization. ... MASF, SPC Control

Vol. 4, No. 11 November 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

859

That choice is good if data is far from a normal distribution.

Special Types of Control Charts. There are X-bar, R, S, U, Np, P and C Control charts.

X-bar is the most common control chart used in Capacity Management. In this chart the sample means are plotted in order to control the mean value of a variable.

C-control chart (Poisson or Counts) plots the number of defectives and is sensitive to changes in the number of defectives in the measurement process. For our field, that could be used to control workload pathologies (e.g. run-always, memory leaks and so on; see example on Figure 3). For C-chart the control limits are calculated as: LCL = c – 3 √c;

UCL = c + 3 √c, Where c is the mean number of defectives. Also, zero serves as a lower bound on the LCL.

Fig 3: C-Control Chart Example

Other types are more appropriate for the mechanical engineering field. 4. MASF, SPC CONTROL AND

HISTOGRAM CHARTS COMPARISON As opposed to the classical X-bar univariate

control chart (Figure 2), a Multivariate Adaptive Statistical Filtering (MASF) chart (as described in [1]) can be most useful for showing daily or weekly profiles of resource usage and exceptions. It is actually a multivariate Control Chart. It could also be treated as a 2D cut (a projection to the most recent day or week) of 3D structure (set or group) of classical control charts set as shown on the Figure 4.

Fig 4: 3D Model of the MASF Chart

The relationship between classical SPC and MASF control charts is demonstrated in Figure 5.

Fig 5: MASF, SPC Control and Histogram Charts Comparison

NOTE: Limits might need to be cut at 100% or

0% of natural thresholds because of the nature of the Utilization metric!

The choice of control limits depends on how close the sample data is to a normal distribution. If it is close, the standard deviation can be used; if it is not so close – use percentiles. Figure 6 shows how that analysis could be done visually by plotting histograms.

Page 4: Journal of Computing::Review of IT Control Chartcisjournal.org/journalofcomputing/archive/vol4no11/vol4no11_6.pdf · control chart type for IT data visualization. ... MASF, SPC Control

Vol. 4, No. 11 November 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

860

Fig 6: Performance Data Normality Visual Test (For the server global CPU utilization MASF chart)

Basically, MASF charts show graphically the

baseline statistics for every data group (by hours and/or weekdays) on a top of actual data. The baseline, which is called a reference set, could be static (fixed range of days in history) or some number of days that dynamically follows the actual data (just like moving average).

Static baseline makes more sense to use for non-production environment, otherwise test or development activities would naturally be statistical exceptions and outliers. Dynamic baseline is good for stable production environment. NOTE: in some observed cases for data that is not normally distributed, the subgroups of the same data (e.g. by hour) can be normally distributed.

5. MAIN TYPES OF CHARTS AGAINST

PERFORMANCE DATA - Classical SPC type of chart (against daily or

hourly aggregated or raw granular data) is shown on Figures 2 and 3.

- 24-hour profile MASF charts for global or

application level data [2] are good for correlation analysis as seen on Figure 9, where the Application level chart being in synch with the global data chart strongly suggests that some particular application exception (anomaly) caused a global exception.

Fig 7: Weekly Control Chart against Daily Data Points

- Monthly, bi-weekly or just weekly profile MASF charts of daily data are shown on Figure 7 (weekly for daily data), Figure 10 (monthly for daily data).

Fig 8: IT-Control Chart

Page 5: Journal of Computing::Review of IT Control Chartcisjournal.org/journalofcomputing/archive/vol4no11/vol4no11_6.pdf · control chart type for IT data visualization. ... MASF, SPC Control

Vol. 4, No. 11 November 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

861

- And finally, the most powerful is the IT-Control Chart (1st introduced by this Author [8]), which is the weekly profile of hourly data, showing historical base-line statistics on top of the most recent week’s actual hourly data points (Figure 8). That is the main SEDS [2,9] graphical tool where an exception considers detected if actual data (black on Figure 7 got out of control limits (UCL is red and LCL is blue on Figure 7). IT- Chart is one of the most effective ways to visualize system performance.

Fig 8a: IT- and EV-Control Charts - Plus there is a variation of IT-Control Chart

called EV-Control chart shown on Figure 8a, introduced in [10], where EV is Exception value (magnitude of an anomaly) meta-metric [2] and marked as a red zone on Figure 9 as an example.

Fig 9: Monthly Control Chart against Daily Data

Fig 10: Synchronized MASF 24-hour profile Control Charts: Application vs. Global CPU usage

The Control chart is one of the possible graphical

tools, but other types of charts are broadly used, such as run charts, trend-forecast charts and bar charts. The Control chart and (especially the IT version) can tell almost everything that is required to understand what is going on with a particular subsystem or metric, now and historically. But to make it clear and understandable for customers and for management, it is highly recommended to use other types of charts in conjunction with the control charts.

Figure 11 shows the case where SEDS reports an exception of unusually low CPU usage for some ESX server, starting with the Pareto type of bar chart with a link to the main chart – the control chart along with the trend-forecast charts. [3]

Looking at all three charts, it becomes clear that some portion of the server capacity was released because one virtual server was moved to another host. By the way, sometimes the additional trend charts can be built manually to provide some details. For instance, Figure 12 shows the CPU utilization of one particular VM server out of three different hosts as a result of V-motion type management software usage.

Page 6: Journal of Computing::Review of IT Control Chartcisjournal.org/journalofcomputing/archive/vol4no11/vol4no11_6.pdf · control chart type for IT data visualization. ... MASF, SPC Control

Vol. 4, No. 11 November 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

862

Fig 11: Bar, Control and Trend Forecast Charts Usage Examples: 3 Different Views on the Same Issues

Fig 12: CPU Utilization of VM Across Several Hosts Due

to Migrating via V-motion

Process level data charts could also be used together with the control charts to find which particular process is responsible for some unusual spikes. Figure 13 shows how the CPU usage by processes chart can be used to explain that incremental daily back-up causes small daily spikes on the control chart of Network traffic (NIC level). The full back-up process caused one big spike per week expanding activity to work hours, which could be dangerous to interfere with other DB2 on-line workload on that server.

Fig 13: Global Network Data IT-Control Chart in Correlation with Process Level CPU usage Chart

6. IT-CONTROL CHART CONCEPT

Is it possible to show the most current metric data, and make some prediction of what to anticipate today-tomorrow, but also show data in retrospect, all in the same picture?

The simple radar in a plane or ship cockpit refreshes current data on top of the most recent ones and

Page 7: Journal of Computing::Review of IT Control Chartcisjournal.org/journalofcomputing/archive/vol4no11/vol4no11_6.pdf · control chart type for IT data visualization. ... MASF, SPC Control

Vol. 4, No. 11 November 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

863

shows the approaching "future". That is a good model of what is needed to organize a control chart!

The “IT-Control Chart” which was recently introduced at the CMG conferences [3] works in a similar way and uses the border line to separate current data from the most recent past data.

Plus it provides a historical base-line to show what can be expected and for comparison purposes. It is organized just like an Outlook full-week calendar to clearly show the weekly pattern from Sunday to Saturday. Refresh rate could be as follows.

- Day – every morning the border shifts 24 hours updating yesterday’s data and appearing at the left of the border with other days of the current week. To the right of the border, the last week’s weekdays are seen starting with the same weekday as today. Good for system performance management.

- Hour – hourly refreshed control chart, the border

moves every hour (see Figure 14). Good for near-real-time monitoring and proactive availability management. [7]

- Minutes? Or seconds? Just like the real radar is

refreshed? Possible, but this would require a lot of overhead which could be a capacity problem. Some modern monitoring products use this rate and can predict and alert a few hours ahead of a severe issue, before users notice any performance degradation!

Fig 14: IT-Control Chart Works like Radar and Full Week Calendar View

How to construct an IT-Control Chart from date-

time stamped performance data? To build the weekly chart against hourly data, do the following three steps as shown on Figure 15:

- Step 1: take one week worth of recent data; - Step 2: put that in a weekly profile form; - Step 3: take some representative historical

reference data and set that as a baseline, then compare it with the most recent actual data by plotting control limits.

Fig 15: Three Steps of Building IT-Control Chart

Page 8: Journal of Computing::Review of IT Control Chartcisjournal.org/journalofcomputing/archive/vol4no11/vol4no11_6.pdf · control chart type for IT data visualization. ... MASF, SPC Control

Vol. 4, No. 11 November 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

864

Repeat for every refreshing interval: nightly for

one day rate, hourly for an hour rate and so on. For instance, Figure 16 shows how the refreshing border would jump the next day showing new exceptions.

Fig 16: Final and Repetitive Step of Refreshing the IT-Control Chart

Other possible variations of IT-Control charts are

as follows.

Not jumping refreshing border:

The jumping border may confuse users, so it could be constantly kept at the right edge of the picture as shown on Figure 17. The second chart shows the 24-hour shift of all days with data to the left, which is needed to keep the last available data at the far right side.

Fig 17: IT-Control Chart without Refreshing Border

The disadvantage of this approach is that the weekly pattern is not obviously seen. That way it would look like an oscillogram or a cardiogram rather than radar.

Near-Real Time IT-Control Chart.

That requires processing data and rebuilding the chart every interval (at least hourly) and would look like the one on Figure 18.

Fig 18: Near-Real Time IT-Control Chart

7. HOW TO BUILD A CONTROL CHART USING A SPREADSHEET AND R

Any control chart could be built just using a spreadsheet with simple formulas. At least one example was already published in the CMG paper [4] and Figure 19 shows the final view of that “Control Chart Builder” (the formulas can be found in that paper).

Fig 19: Control Chart Builder Example

Another example of using a spreadsheet to build a classical SPC Control chart with moving (or static) reference set (baseline) is shown on Figure 20.

Fig 20: Spreadsheet Control Chart Builder

Page 9: Journal of Computing::Review of IT Control Chartcisjournal.org/journalofcomputing/archive/vol4no11/vol4no11_6.pdf · control chart type for IT data visualization. ... MASF, SPC Control

Vol. 4, No. 11 November 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

865

The following formulas were used.

Raw data are in column A and B. Center line is 7-day Moving Average =AVERAGE(B:B+7) => F Standard deviation =STDEV(B:B+7) => G

Upper Limit =F+M$2*G = H Lower Limit =F-M$2*G = J

Other non-parametric control limits can be used:

Lower Limit =PERCENTILE(B3:B+10,0.05) Upper Limit =PERCENTILE(B3:B+10,0.95)

To build a control chart automatically, a different statistic oriented programming system can be used. Here is the example of using R-System

[6]

To build jpeg output (shown on Figure 10) against the following CSV data:

The following R-program can be used (Figure 21).

Fig 21: R Script to Build a Control Chart

Fig 22: R Script and R GUI to Build IT- Control Chart [7]

Page 10: Journal of Computing::Review of IT Control Chartcisjournal.org/journalofcomputing/archive/vol4no11/vol4no11_6.pdf · control chart type for IT data visualization. ... MASF, SPC Control

Vol. 4, No. 11 November 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

866

Figure 22 shows the R GUI, a graphical output window, the fragment of input data and the script to build a typical IT-Control Chart. The same chart is presented on Figure 23 as an Excel pivot chart. That chart was built against a Mainframe DB2 average I/O time per threads metric, and shows a real-life case of capturing a severe malfunction of a DASD controller.

By the way, the “PivotChart” type of report is an excellent way to visualize and analyze a set of control charts. Figure 23 shows control charts for an ATM application, but easily could be built for other available applications just by choosing one from a pull-down box in the left upper corner.

Note how the chart shows the 1st occurrence of the exception, which was not possible to see by using other type of charts as the level of I/Os was low, but unusual. Only a Control Chart clearly and proactively detects that.

Fig 23: Pivot Control Chart 8. WHY ARE THE CONTROL CHARTS

SO POWERFUL? Let’s summarize why it is a good idea to use

control charts for IT application management.

- “Behavior learning”: In case of dynamic baseline usage that adjusts itself statistically to some significant event (upgrades, LPAR reconfigurations and so on) because the historical period follows the actual data and every event will eventually be older than the oldest day in the reference set (in the baseline).

- “Correlation” allows you to see where the system performance and/or business driver

metrics correlate, simply by analyzing synchronized control charts of different subsystems or metrics.

- “Do Not Mix Shifts”: A Control Chart by its nature visualizes the separation of work or peak time, and off times.

- “Statistical Model Choice” means playing with different statistical limits (e.g., one standard deviation vs. two or more standard deviations or using percentiles for non-normally distributed data samples) to tune the system and reduce the rate of false positives.

- “Summarization”: It uses summarized data (e.g. 6 or 8-month-long history of hourly data) and provides a smooth picture without too many spikes and small oscillations.

- “Outliers detection”: All workload pathologies (e.g. run-away or memory leak) are definitely statistically unusual; they are captured and then should be removed from historical data to keep control charts accurate.

Speaking about workload pathologies, Figures 24

and 25 show examples of the ramp type of run-away and memory leaks situations, which are clearly seen and captured long before the time when they become real issues (approaching 100% of resource usage).

Fig 24: Citrix Server Run-away Situation Example

One case on that figure shows a spectacular “saw” type of historical baseline of daily reboot (typical and simplest way to fight memory leaks), and actual data indicated some days where rebooting was forgotten. The other case shows that even a slow-going memory leak pattern, which is very hard to capture automatically, can be captured by using a control chart and exception detectors such as SEDS [5].

Page 11: Journal of Computing::Review of IT Control Chartcisjournal.org/journalofcomputing/archive/vol4no11/vol4no11_6.pdf · control chart type for IT data visualization. ... MASF, SPC Control

Vol. 4, No. 11 November 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

867

Fig 25: Examples of Control Chart Used to Report Memory Leaks

Not only can pure server metrics be presented by

using IT-Control Charts, other IT application for instance, the end-to-end response time (Figure 26).

Fig 26: End-to-End Response Time Control Chart

In the case, when an unusual (and may be still

acceptable) response time is seen on the control chart, other subsystems of an application (e.g., database or network) should be presented on another control chart and should help to proactively isolate culprits of possible real issues so as to undertake timely preventive actions.

Fig 27: IT-Control Charts to Detect CPU Usage

Anomalies

Page 12: Journal of Computing::Review of IT Control Chartcisjournal.org/journalofcomputing/archive/vol4no11/vol4no11_6.pdf · control chart type for IT data visualization. ... MASF, SPC Control

Vol. 4, No. 11 November 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

868

Not only response time but other IT related metrics could be controlled by IT-Control Charts, for example, some business driver metrics, such as the number of web application hits or different types of business transactions or records. (Discussed in [3,4, 10]). Analyzing them via IT-Control Charts in synch with server or application metrics IT-Control Charts as shown on Figure 27, provides a powerful way to know how well IT supports the Business. REFERENCES [1] Jeffrey Buzen and Annie Shum, "MASF -- Multivariate Adaptive Statistical Filtering",

CMG1995 Proceedings. [2] Igor Trubin, “Global and Application Levels

Exception Detection System, Based on MASF Technique”, CMG2002 Proceedings.

[3] Igor Trubin, “Exception Based Modeling and Forecasting”, CMG2008 Proceedings.

[4] Merritt, Linwood and Trubin, Igor, “Disk Subsystem

Capacity Management, Based on Business Drivers,

I/O Performance Metrics and MASF”, CMG2003 Proceedings.

[5] Igor Trubin, “Capturing Workload Pathology by

Statistical Exception Detection System”, CMG2005 Proceedings.

[6] “The R Project for Statistical Computing” at http://www.r-project.org

[7] “Near-Real-Time IT-Control Chart R- Simulation”,

technical blog posting, at http://itrubin.blogspot.com/2010/06/near-real-time-it-control-chart-r.html

[8] Igor Trubin, “IT -Control Charts”,

CMG2010 Proceedings.

[9] Igor Trubin, “SEDS-Lite: Using Open Source Tools (R, BIRT, MySQL) to Report and Analyze Performance Data” CMG2012 Proceedings.

[10] Igor Trubin, Shadi Ghaith, “AIX frame and LPAR

level Capacity Planning. User Case for Online Banking Application”,CMG2012 Proceedings.