consider today’s presentation a first exposure to basic ...asq-illiana.org › wp-content ›...

0

Consider today’s presentation a first exposure to basic data mining techniques. At the end of the session you will hopefully have a basic appreciation for how the methods work and why they could be attractive additions to the six sigma tool kit you may have.

You will not be able to perform an analysis yourself when the session is over, but references will be supplied to support starting the learning process.

And as always, keep in mind that “All models are wrong, but some are useful.”

We’ll zero in on two of the simplest methods to understand, communicate, and perform: classification and regression trees. And we’ll compare them to multiple linear regression, which is commonly used at this point in a six sigma project.

Note that if the response is a discrete variable, multiple linear regression cannot be used. Logistic regression is the proper tool in that case. BTW, just because you can assign numerical values to the levels of a discrete variable doesn’t mean you can use MLR. Values are still only categories.

2

Recent article in Economist magazine cites studies that suggest that advanced algorithms derived by mining techniques will make additional occupations obsolete, ones that even ten years ago would have seemed “safe.”

•Robots in manufacturing are “learning” from experience.

•Computers using algorithms are better at detecting patterns in financial & security images than people can.

•Algorithmic pattern recognition in medical images can detect smaller abnormalities than people are able to do.

According a source quoted in the article, many occupations have p>0.50 of diminishing job prospects over the next two decades. Includes airline pilots, machinists, word processors & typists, real estate agents, technical writers, retail sales people, accountants & auditors, and telemarketers.

Moneyball & Nate Silver’s accurate prediction of the last two presidential elections made use of predictive analytics. It seems obvious that quality professionals, including those involved in six sigma, will be touched by “big data” at some point in their careers. Yet, even if the topic has no direct impact on your job, as a consumer and as a citizen, it’s important that you

3

understand the basics.

3

Can’t assume that the data contained in a file is reliable, by which we mean stable and predictable over time. Trials are run, breakdowns occur. Each of these can either add to the range of values out there (possibly a good thing) or add noise, details of which are likely lost as time since the event has lengthened.

How much useful information a project obtains from a database is at least partly determined by the quality of effort put into verifying the quality of the data available.

4

Always verify that the data you will be using contains only valid results.

Make sure what you thought you were asking Access or SQL to get is what you really got. If not, fix the query and keep trying until it’s right.

In databases with many columns, it’s not uncommon to find missing values or ones you decide to delete as bogus. Software commonly just deletes the entire record from the analysis. This can quickly make your big dataset not nearly as big.

Fix values that are clearly just blunders, like misplacing a decimal point. Sometimes empty cells have place holders, like ‘9999.’ These can be set to missing, or values imputed. Imputation replaces empty or “defective” values with another one. Sometimes the mean or median is substituted. Other, more sophisticated, methods also exist, but are beyond what will be covered today.

5

There are comparable assumptions in the logistic regression case as well.

7

In noisy industrial systems it’s quite common to encounter low R^2 values, even when one knows all critical variables have been included in the model. Max R2 = 1 – PctMeas. Error. If one knows the variance components from measurements, he can calculate the maximum value possible and normalize to an “error free” basis.

8

Correlation between predictors is called multi-collinearity. In MLR it is detected by asking the software to supply variance inflation factors (VIF) for the predictors. Rule of thumb is

0<VIF<~5 : okay

~5<VIF<10: Marginal

VIF>10: Multi-collinearity is a problem. Find somebody who can do Principal Components Analysis.

9

Data transformation can be a black art. Requires working knowledge of mathematical functions that many of us forgot shortly after leaving Algebra II in high school. Not an insurmountable problem, but something to keep in mind when patterns are detected in the residuals.

10

Standardized means that the residual value has been divided by the standard deviation of the residuals. The values are, thus, the # of stdev’s they are away from the average.

11

The values on the four panels are a cautionary tale of what happens when one does not do a residuals analysis, or, better yet, study the relationships between variables before starting a regression analysis.

Developed by Princeton statistician Frank Anscombe and published in American Statistician ( 27 (1): 17–21) in 1973. Often called “Anscombe’s Quartet.”

12

Trees themselves are maps of inter-relationship of variables in what may be very complex models.

Just because the assumptions we discussed for MLR are no longer a problem, does not suggest that the methods are fool proof.

17

Process the same for both methods. For classification trees, aim is to put all the “pink dots” in one box and the “green dots” in another. For regression trees want to minimize variation within each node.

Both methods split parent nodes into two child nodes. Other methods allow more splits on a node.

As we’ll see, significant effort is expended prior to submitting the dataset to the software. The effort is ideally the same expended before starting any statistical analysis. Aim is to assure that only valid data are included in the analysis.

Over-fitting models occurs in MLR when analyst only focuses on maximizing R^2 value. In CART, one can fit models so that each leaf node (last split on a branch) has only one “color”. This tree might have one leaf node for each observation, and it would obviously have no predictive value, so we prune back far enough to eliminate silly.

Unlike MLR, in CART methods, the original dataset is split into training, validation, and test sets. We build a tree with the first, prune it with the second, and see how well it works with the last. No reason why one wouldn’t want to consider the same process for MLR.

The dataset used in the first example is now a classic and was developed by R. A. Fisher and published in the Annals of Eugenics in (1936). Fifty plants from each of three species of irises--setosa, virginica, and versicolor--were included in the set. The object was to determine whether the lengths & widths of their petals could discriminate between the three

We will use a classification tree in place of the much more complex method Fisher devised for the task. At the end, we will have defined the parameter settings that split the three species most successfully and will assess how good a job our model does at discriminating between the three.

20

Black, red, and green dots tend to occupy space with little overlap, but note that some does occur between red and green.

One could eyeball where splits might go (it’s pretty apparent for black and red) and write down rules to convey the info to others. However most cases aren’t this obvious—nor is the placement of the horizontal line.

21

This is the classification tree for the iris data. At the top, the root node, notice that each species is one-third of the observations. The software looks at all possible ways of splitting the data for both potential predictors and determines that the purity of splitting is maximized if a value of 2.45 inches for petal length is elected. The left daughter node contains only (and all of the) setosa observations. Purity of the right node has increased to 50% from 33% for each of the other two. The software repeats the recursive partitioning process a second time, and determines that splitting the right node at a petal width of 1.75 inches gives the cleanest split of the other two species.

Unlike the other case, each of the daughter nodes contains observations from the minority species. Seven items are in the wrong nodes, or 4.6% of the original 150 items. The misclassification rate that this represents is a commonly used metric to assess model performance, sort of like an R2, but not exactly. While 4.6% sounds good for this case, the boss wouldn’t be keen on your model for predicting mortgage defaults or whether an individual was likely to be a terrorist.

The rules for determining into which species to place a future iris plant are in the yellow box. These are easily communicated to others

22

Both this dataset and the one for the iris example are available from the University of California Irvine. http://archive.ics.uci.edu/ml/.

The regression that appeared in the original 1978 paper contained 14 terms, including the intercept, several of which were log terms and other which were square or quadratic terms. It almost certainly took several days and many iterations before arriving at their model; not to mention the skill to perform the analysis. (Of course it was worse since this was before the days of desktop PC’s. Punching the cards alone probably took a grad student many hours!)

23

Same process is followed here as was for the classification tree. However, since there were additional variables, the tree was grown larger than the one shown here. The original data had been partitioned into three sets, the largest of which was used to train the model. The second, smaller set was used to measure the percent error for trees of increasing numbers of nodes. As the tree is first grown the percent error decreases for each additional split, but at some point it begins to increase as “noisy” splits are added that don’t bring much to the party. We prune the tree down to it size at the inflection point.

24

Automated system captured data on nearly 100 variables for each defect that cameras spotted as strip moved under them. It was felt that making use of this data would make it possible to switch from assessing performance with rejection data to measuring it with defect densities from the camera system.

Knew from experience that the defect in question shared similar values for some variables in other defects. This led to a high false alarm rate which rendered the camera system fairly useless for the intended purpose.

26

The steps listed are always done in CART. The first three are critical, so let’s expand on the what’s and the why’s of them.

27

The first four of these are familiar to most people, but correlation matrices, not so much. The potential predictors are placed as column and row headings of a table, and r-values for each pairwise combination of variables populate the cells. Often, the p-values of how significant the correlations are is an option. Significant correlations (suggest p<0.1 or 0.15) indicate variables, which if placed in the model together, could lead to multi-collinearity.

30

It’s common to find that the response to be modeled is quite rare (think credit card fraud in the quaint old days or tumors on mammograms). Need enough “1’s” with the “0’s” to build a reliable model. It’s okay to over-sample items with a rare response; in fact in many cases it is essential.

The coils selected had a total of over 43,000 allegedly pink defects. Around 4100 of them were verified pink. By over-sampling “1’s” and having a large number of coils, it was felt that there were enough true defects to continue. Had the purpose been to predict something with critical implications, this might not have been the wrong conclusion. Context is everything.

31

Always make use of anything you know about a process when performing any analysis. While it may feel scientific to enter any experiment with no assumptions about reality, it can actually be just the opposite. One seldom starts with a “blank piece of paper.” Not validating that what you know (or think you know) is true comes with risks, as will be seen below.

32

In the left panel, the coil length is divided into 20 categories. Since defects are know to occur uniformly along the length, on e would have expected to find roughly 5% of the defects identified by the camera system in each bar. There is a clear bias toward finding them in the 5% nearest each end of the coils. We knew from observation of many coils on the line that this could happen, but were somewhat surprised by the extent.

(This slide was turned over to local management and drove a separate project that virtually eliminated the second defect, increasing producttyield by a few percentage points).

The right panel should have looked like a bell shaped distribution across the width, with a sizeable majority within +30% of the center of the coils. That clearly is not the case here; the outer 10% on both edges has roughly 2.5x more defects than the next nearest section.

Both panels called for further investigation. Since images for each defect had been checked, we knew with great certainty that true defects were very rare in the area of interest. Ultimately, the software would exclude 5% from each end and 3% from either edge. While some true defects were deleted, the vast majority were false positives, and the quality of the classifying

33

model was acceptable. Having done the preliminary screening, there was little question that the software did what the black belt would have done, though the split criteria were more rigorous with use of the software.

33

On the surface, it would seem unnecessary, even arbitrary, to exclude the variables that “didn’t apply.” After all, the only cost would be a small increase in the time to run the software.

The risk of leaving the variables in comes from the prospect that recorded values would serendipitously line up with levels of the dependent variable. To the extent to which that was allowed to occur we would have deliberately included noise in our model, noise that c(w)ould have a different structure in the future.

34

Three step process to building the model with, in this case, JMP 5.1:

• Build with the 60% of data selected reserved for building. Overgrew the tree and pruned back.

• Validated model by using the 30% selected for validation. Tree was very similar.

• Ran remaining 10% through the rules for the model to determine how well it predicted the outcomes for this split.

ABS(SEN) is distance from the center on either side of the strip.

NXSD is a measure of how far neighboring pixels in the defect are from the center of the defect. Small is closer.

Coil end= ‘yes’ where % was 5% or less from the end.

Notice that the first split at ABS(SEN)=34% means that 68% of the actual defects (~4000) were captured on the left node. Only 153 were captured on the right node. Splitting that node at ABS(SEN)<=43% isolated 138 of these.

Again on the first split left node, smaller values of NXSD were more characteristic of the defect than larger ones, and splitting that left node

35

based on whether the defect were at the end of the coil or not, further concentrated the purity of the nodes. A final split at ABS(SEN) for coilend=‘no’ completed the final tree.

Initial tree was grown to more levels and additional splits on what now are leaf nodes. Pruned back to final version.

35

The modeling exercise produced the tree on the last slide. A set of rules based upon the tree were developed by the software (or by inspection). This was not the only tree possible, and hence not the only set of rules that could have been used to classify observations in the split of data held back for testing how well the model would work.

One of the positives about CART methods is that the model is easily communicated to management personnel.

It remains to determine how well the model predicts outcomes. This equation was used to create a column of 0’s and 1’s in the dataset that were the predicted outcomes for the hold back data. Data from that column were compared to values from the column that contained the actual values for the observed defects: true defect or false. There are four possible outcomes:

• Predict true and was a true defect;

• Predict false and was not a real defect;

• Predicted true, but was not a real defect;

• Predict false, but it was a real defect.

36

With MLR, we typically stop looking when the R2 is good and no problems were surfaced in the analysis of the residuals. In CART, we quantify how well our model matches the actual results in a confusion matrix like the one in this slide. The percentages in the NW & SE quadrants represent our successes; those in the NE & SW quadrants are our prediction mistakes. Our model correctly predicted in 84.5% of the cases in the test set; it misclassified 15.5% of observations in the test set.

Is this model good? Depends on how critical making a wrong decision is. In our case, the cost of checking each “pink dot” to generate a control chart was real--and high, whereas risks associated with miscalculating the defect density were relatively low. The model was, therefore, implemented—but with some strings attached.

There are other ways to gauge performance, but this is probably the simplest one to understand—as well as being the easiest to compute and explain.

37

After looking at each “pink dot” on nearly 2000 coils over a year-and-a-half, the black belt and his partner had never seen a single coil with more than around 100 slivers, so seeing a significant number of outlier points was contrary to their experience. A number maps for coils that the model said were rotten with defects were reviewed, and the reality was that frequencies were not elevated. Defects of types with characteristics similar to the one of interest were found in some of the cases, mostly in numbers too small to explain the magnitude of the predicted value.

Options were to start over including additional characteristics from the inspection system, or to assume that the outliers were functions of system noise. The second option was chosen, given the need to free resources for more critical activities. Periodically over the next few years, however, maps for coils with aberrant counts were reviewed. In every case, the high numbers predicted did not approach the reality of true defects.

Another possibility, one that highlights a weakness of CART analyses, is that, beyond some point, the data did not lend themselves to the splitting algorithm. Splits in CART are vertical or horizontal. Recall the Iris example from earlier. Imagine all the setosa observations in a triangle in the upper left corner, the virginica in a triangle in the lower right corner, and the

38

versicolor distributed over the stripe in the middle. The algorithms will split the variables into many small ranges—that still have low purity between categories. These will be pruned back from the initial tree during validation.

38

Readily apparent that 99% of all predictions by the model shown here were less than about 60-70/km. Experience of the black belt and his partner expected about that many as a maximum, barring a catastrophic special cause. This graph shows too many “catastrophes” by a wide margin.

39

Don’t always have monstrous datasets, or even want them. Techniques have been in use over the past 10-15 years in which the dataset is sampled with replacement many times. Many trees are grown (forest) and a sort of consensus is arrived at by “voting,” or by other more sophisticated methods. In the process many of the problems that classification & regression trees have are minimized. Won’t likely need forests for “beginner” projects.

We introduced methods that split nodes into two descendants. There are many cases where more than two splits make sense. CHAID (Chi-square Automatic Interaction Detection) analysis handles this situation. This is one of the more venerable data mining methods and similar in many respects to CART.

Other methods exist for text mining (including verbal communications), image analysis (including facial recognition), and even Bayesian Networks, which under the right conditions can establish root causal relationships in complex systems. All are well beyond today’s basic introduction, but might find applications where you work, depending upon the nature of the business

42

If you have access to JMP, you already have the ability to run CART analyses. You’ll find it in the Partition Platform. I have version 5.1 (clunky-JMP), so versions at least that old have the capability.

XLMiner is an Addin for Excel from Frontline Systems. Both the current version of JMP and XLMiner are priced at around $1500 for a license. You can google both for additional information. I’ve used both with no significant issues. Both have good documentation. JMP obviously is one of the major statistical packages and does much more than merely mine data. XLMiner also does regressions (Linear as well as Logistic) principal components and some other things, but is not as all-encompassing as JMP. Still, it greatly expands the repertoire of Excel.

For those on a more limited budget, as in “none,” there is a free option, R, that comes with a price: you will need to learn a programming language. One of the references in the companion paper helps one learn R basics, enough so that one can work the case studies that make up the book. It’s very well done and comes recommended if R is in your future.

Statistics.com offers on-line classes in a variety of statistical applications, including several in “Predictive Analytics.” Classes are at one’s own speed

43

over, typically, four weeks, and consist of readings from materials that the professor assigns and homework assignments. There is a board for each class where questions can be posted, and they typically are answered within a day. Costs per class are around $500 plus whatever textbooks cost. A trial of XLMiner comes with the Predictive Analytics class.

43

Should you not be able to access these materials, feel free to email me at [email protected], and I’d be happy to sent them to you.

44

45

Thanks for your attention today. While this has merely been a 30,000 foot look at the two techniques, hopefully it has made you aware of another tool that can be applied in the process of narrowing down the options of what needs to be studied in a six sigma project.

consider today’s presentation a first exposure to basic ...asq-illiana.org › wp-content ›...

Documents