data what type of data do you have v2.1

32
Data for Statistics - 1 Version 2.1 © 2010 Kasse Initiatives, LLC DATA

Upload: timkasse

Post on 21-Jan-2015

1.335 views

Category:

Documents


3 download

DESCRIPTION

Data for Statistics - A discussion about Data Types not found in the CMMI

TRANSCRIPT

  • 1. DATA

2. Data
Data Input for Analysis and Interpretation
Data are generally collected as a basis for action
You must always use some method of analysis to extract and interpret the information that lies in the data
The type of data that has been collected will determine the type of statistics or analysis that can be performed
Making sense of the data is a process in itself
Always provide a context for data
Data has no meaning apart for their context
Data should always be presented in such a way that preserves the evidence in the data for all the predictions that might be made from these data
3. Data - 2
Data should be completely and fully described
Who collected the data?
How were the data collected?
When were the data collected?
Where were the data collected?
What do these values represent?
If the data are computed values, how were the values computed from the raw inputs?
4. Data - 3
Variation exists in all data and consists of both noise (random or common cause variation) and signal (nonrandom or special cause variation)
Without formal and standardized approaches for analyzing data, you may have difficulty interpreting and using your measurement results
When you interpret and act on measurement results, you are presuming that the measurements represent reality
5. Data - 4
To use data safely, you must have simple and effective methods not only for detecting signals that are surrounded by noise,
but also for recognizing and dealing with normal process variations when there are no signals present
Drawing conclusions and predictions from data depends not only on using appropriate analytical methods and tools,
but also on understanding the underlying nature of the data and the appropriateness of assumptions about the conditions and environments in which the data were obtained
6. Data Definitions
Categorical vs. Quantitative Variables - Variables can be classified as categorical (aka, qualitative) or quantitative (aka, numerical)
Categorical - Categorical variables take on values that are names or labels. The color of a ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier) would be examples of categorical variables.
Quantitative - Quantitative variables are numerical. They represent a measurable quantity.
For example, when we speak of the population of a city, we are talking about the number of people in the city - a measurable attribute of the city. Therefore, population would be a quantitative variable
7. Data Definitions - 2
Discrete vs. Continuous Variables - Quantitative variables can be further classified as discrete or continuous.
If a variable can take on any value between two specified values, it is called a continuous variable; otherwise, it is called a discrete variable.
Examples to clarify the difference between discrete and continuous variables.
Suppose the fire department mandates that all fire fighters must weigh between 150 and 250 pounds. The weight of a fire fighter would be an example of a continuous variable; since a fire fighter's weight could take on any value between 150 and 250 pounds.
Suppose we flip a coin and count the number of heads. The number of heads could be any integer value between 0 and plus infinity. However, it could not be any number between 0 and plus infinity. We could not, for example, get 2.5 heads. Therefore, the number of heads must be a discrete variable.
8. Attributes Data vs. Variables Data
9. Variables Data
Variables data is measured and plotted on a continuous scale
With variables data, an actual numeric estimate is derived for one or more characteristics of the population being sampled such as:
Time
Temperature
Length
Weight
Height
Volume
Voltage
Horsepower
Torque
Speed
Cost
10. Variables Data - 2
In software, examples of variables data include:
Effort expended - (Number of hours, days, weeks, years, etc., that have been expended by a workforce member on an identified topic)
Years of experience - (Total number of years of experience per category)
Memory utilization - (% of total memory available)
CPU utilization - (% of CPU used at any given moment in time)
Cost of rework - (Dollars and cents calculation of the rework based on the effort put forth by anyone involved in the finding and fixing of reported problems)
11. Counts Could Be Treated as Variables Data
There are many situations where counts get used as measures of size:
Total number of requirements
Total lines of code
Total bubbles in a data-flow diagram
Customer sites
Change requests received
Total people assigned to a project
When we count these things, we are counting all the entities in a population, not just the occurrence of entities with specific attributes
These should always be treated as variables data even though they are instances of discrete counts
12. Attributes Data
When working with attributes data, the focus is on learning about one or more specific non-numerical characteristics of the population being sampled
When attributes data are used for direct comparisons, they must be based on consistent areas of opportunity if the comparisons are to be meaningful
If the number of defects that are likely to be observed depends on the size (lines of code)of a module or component, all sizes must be nearly equal
If the probabilities associated with defect discovery depend on the time spent on inspecting or testingthe elapsed time spent must be nearly equal
13. Attributes Data - 2
In general, when the areas of opportunity for observing a specific event are not equal or nearly so, the chances of observing the event will differ across the observations
Then we must normalize (convert to rates) by dividing each count by its area of opportunity before valid comparisons are made
Conditions that make us willing to assume constant areas of opportunity seem to be less in software environments
Normalization is almost always needed for software!
14. Attributes Data - 3
Example:
If the defects are being counted and the size of an item inspected influences the number of defects found, some measure of item size will also be needed to convert defect counts to relative rates that can be compared in meaningful ways (defects per lines of code)
If the variations in the amount of time spent inspecting or testing can influence the number of defects found, these times should be clearly defined and measured as well
15. Attributes Data - 4
One of the keys to making effective use of attributes data lies in preserving the ordering of each count in space and time
Sequence information (the order in time or space in which the data is collected) is almost always needed to correctly interpret counts of attributes
Make the counts specific Make sure there is an operational definition (clear set of rules and procedures) for recognizing an attribute or entity if what gets counted is to be what the user of the data expects the data to be
16. Attributes Data - 5
Attributes data is counted and plotted as discrete events:
Shipping errors
Percentage waste
Number of defects found
Number of defective items
Number of source statements of a given type
Number of lines of comments in a module of n lines
Number of people with certain skills on a project
Percentage of projects using formal inspections
Team size
Elapsed time between milestones
Staff hours logged per task
Backlog
Number of priority-one customer complaints
Percentage of non-conforming products in the output of an activity or a process
17. The Key to Classifying Data
The key to classifying data as attributes data or variables data depends not so much on whether the data are discrete or continuous, but on how they are collected and used
The total number of defects found is often used as a measure of the amount of rework or retesting to be performed
It is viewed as a measure of size and treated as variables data
It is normally used as a count based on attributes
The method of analysis you choose for any data will depend on:
The questions you are asking
The data distribution model you have in mind
The assumptions you are willing to make with respect to the nature of the data (Page 79)
18. Data Type Classifications
Discrete
Continuous
19. Distributional ModelsRelationship to Chart Types
Each type of chart is related to a set of assumptions (a distributional model) that must hold for that type of chart to be valid.
There are six types of charts for attributes data
NP
P
C
U
XmR for counts
XmR for rates
20. XmR charts have an advantage over np, p, c, and u charts in that they require fewer and less stringent assumptions
They are easier to plat and use
They have wide applicability
Recommended by many quality-control professionals
When assumptions of the distributional model are met, however, the more specialized np, p, c, and u charts can give better bounds for control limits and can offer advantages
Distributional Models Relationship to Chart Types - 2
21. Distributional ModelsRelationship to Chart Types - 3
NP Chart An np chart is used when the count data are binomially distributed and all samples have equal areas of opportunity
These conditions occur in manufacturing settings when there is 100% of lots of size n (n is constant) and the number of defective units in each lot is recorded
P Chart a p chart is used when the data are binomially distributed but the areas of opportunity vary from sample to sample
A p chart could be appropriate if the lot size n were to change from lot to lot
22. Distributional ModelsRelationship to Chart Types - 4
C Chart a c chart is used when the count data are samples from a Poisson distribution and the samples all have equal-sized areas of opportunity
U Chart a u chart is used in place of a c chart when the count data are samples from a Poisson distribution and the areas of opportunity are not constant
Defects per thousand lines of code is an example for software
NP, P, C and U charts are the traditional control charts used with attributes data
XmR Chart Useful when little is known about the underlying distribution of when the justification for assuming a binomial or Poisson process is questionable
Almost always a reasonable choice
23. Distributional ModelsRelationship to Chart Types - 5
More About U Charts U charts seem to have the greatest prospects for use in software settings
U charts require normalization (conversion to rates) when the areas of opportunity are not constant
Poisson might be appropriate when counting the number of defects in modules during inspection or testing
Defects per thousand lines of source code is an example of attributes data that is a candidate for u charts
Although u charts may be appropriate for studying software defect densities in an operational environment, we are not aware of any empirical studies that have generally validated the use of Poisson models for nonoperational environments such as inspections
24. Distributional ModelsRelationship to Chart Types - 6
Defects per module or defects per test are unlikely candidates for u charts, c charts, or any other charts for that matter
The ratios are not based on equal areas of opportunity Cant be normalized
There is no reason to expect them to be constant across all modules or tests when the process is in statistical control
25. Distributional ModelsRelationship to Chart Types - 7
If you are uncertain as to the model that applies, it can make sense to use more than one set of charts
If you think you may have a Poisson situation but are not sure that all conditions for a Poisson process are present, then plotting both a u chart and the corresponding XmR charts should bracket the situation
If both charts point to the same conclusions, you are unlikely to be led astray
If the conclusions differ, then you should investigate your assumptions or the events
26. Presenting Data
While it is simple and easy to compare one number with another, such comparisons are limited and weak
Limited because the small amount of data used
Weak because both of the numbers are subject to variation
This makes it difficult to determine just how much of the differences between the values is due to variation in numbers and how much is due to real changes in the process
27. Presenting Data - 2
Graphs there are two basic graphs that are the most helpful is providing the context for interpreting the current value
Time series graph (Run Chart)
Have months or years marked off on the horizontal axis and possible values marked off on the vertical axis
As you move from left to right, there is a passage of time
By visually comparing the current value with the plotted values for the preceding months you can quickly see if the current value is unusual or not
Histogram (Tally Plot)
An accumulation of the different values as they occur without trying to display the time order sequence
28. Run Charts
Number of Required Changes to a Module
as the Project Approaches Systems Test
Syntax
Check
Desk
Check
Code
Review
Unit
Test
Integration
and Test
Systems
Test
29. 20
18
16
14
12
10
Number of Days
8
6
4
2
0
32
56
48
46
44
42
40
38
36
54
52
50
34
Product Service Staff Hours
Histograms
30.










































































PROCESS CONTROL CHART TYPE:
METRIC:
A point above or below the
control linessuggests that the
measurement has a special
preventable or removable cause
Upper
Control
Limit
(UCL)

The chart is used for continuous
and time control ofthe process
and prevention of causes
Upper and
Lower
Control Limits
representthe
natural variation
In the process
Center Line (CL)
(Mean of data used to
set up the chart)
The chart is analyzed using
standard Rules to define the
control status of the process
Plotted points are either
individual measurements or the
means of small groups of
measurements
Lower
Control
Limit
(LCL)


Data
relating to
the process
Statistical Methods for Software Quality
Adrian Burr Mal Owen, 1996
Numerical data taken
in time sequence
31. Impacts of Poor Data Quality
Inability to conduct hypothesis and predictive modeling
Inability to manage the quality and performance software or application development
Ineffective process change instead of process improvement
Ineffective and inefficient testing causing issues with time to market, field quality, and development costs
Products that are costly to use within real-life usage profiles
32. References
Brassard, Michael & Ritter, Diane, The Memory Jogger II A Pocket Guide of Tools for Continuous Improvement & Effective Planning, GOAL/QPC, Salem, New Hampshire, 1994
Florac, W.A. & Carleton, A.D. Measuring the Software Process Addison-Wesley, 1999
Six Sigma Academy, The Black Belt Memory Jogger A Pocket Guide for Six Sigma Success, GOAL/QPC, Salem, New Hampshire, 2002
Wheeler, Donald J. Understanding Variation: The Key to Managing Chaos, Knoxville, Tennessee: SPC Press, 2000