predictive modeling with clementine v11_1

Predictive Modeling with Clementine®

32492-001

V11.1 4/07 mr/jm

For more information about SPSS® software products, please visit our Web site at http://www.spss.com or contact SPSS Inc. 233 South Wacker Drive, 11th Floor Chicago, IL 60606-6412 Tel: (312) 651-3000 Fax: (312) 651-3668 SPSS is a registered trademark and its other product names are the trademarks of SPSS Inc. for its proprietary computer software. No material describing such software may be produced or distributed without the written permission of the owners of the trademark and license rights in the software and the copyrights in the published materials. The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013. Contractor/manufacturer is SPSS Inc., 233 South Wacker Drive, 11th Floor, Chicago, IL 60606-6412. Graphs powered by SPSS Inc.’s nViZn(TM) advanced visualization technology http://www.spss.com/sm/nvizn Patent No. 7,023,453 General notice: Other product names mentioned herein are used for identification purposes only and may be trademarks of their respectivecompanies. Project phases are based on the CRISP-DM process model. Copyright © 1997–2003 by CRISP-DM Consortium (http://www.crisp-dm.org). Microsoft and Windows are registered trademarks of Microsoft Corporation. IBM, DB2, and Intelligent Miner are trademarks of IBM Corporation in the U.S.A. and/or other countries. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. DataDirect and SequeLink are registered trademarks of DataDirect Technologies. Copyright © 2001–2005 by JGoodies. Founder: Karsten Lentzsch. All rights reserved. Predictive Modeling with Clementine Copyright © 2007 by SPSS Inc. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher.

Predictive Modeling with Clementine

CHAPTER 1: PREPARING DATA FOR MODELING....................... 1-1 1.1 INTRODUCTION................................................................................................................... 1-2 1.2 CLEANING DATA ................................................................................................................ 1-3 1.3 BALANCING DATA.............................................................................................................. 1-4 1.4 NUMERIC DATA TRANSFORMATIONS................................................................................. 1-6 1.5 BINNING DATA VALUES..................................................................................................... 1-9 1.6 DATA PARTITIONING ........................................................................................................ 1-12 1.7 ANOMALY DETECTION..................................................................................................... 1-14 1.8 FEATURE SELECTION FOR MODELS.................................................................................. 1-19 SUMMARY EXERCISES............................................................................................................ 1-24

CHAPTER 2: NEURAL NETWORKS................................................... 2-1 2.1 INTRODUCTION TO NEURAL NETWORKS............................................................................ 2-1 2.2 THE NEURAL NETWORK NODE .......................................................................................... 2-3 2.3 MODELS PALETTE ............................................................................................................ 2-10 2.4 VALIDATING THE LIST OF PREDICTORS............................................................................ 2-12 2.5 UNDERSTANDING THE NEURAL NETWORK ...................................................................... 2-14 2.6 UNDERSTANDING THE REASONING BEHIND THE PREDICTIONS ....................................... 2-24 2.7 SAVING THE STREAM ....................................................................................................... 2-27 2.8 MODEL SUMMARY ........................................................................................................... 2-28 2.9 ADVANCED NEURAL NETWORK TECHNIQUES ................................................................. 2-28 2.10 TRAINING METHODS ...................................................................................................... 2-28 2.11 THE MULTI-LAYER PERCEPTRON................................................................................... 2-29 2.12 THE RADIAL BASIS FUNCTION ....................................................................................... 2-30 2.13 EXPERT OPTIONS............................................................................................................ 2-31 2.14 AVAILABLE ALGORITHMS.............................................................................................. 2-34 2.15 WHICH METHOD, WHEN?............................................................................................... 2-41 2.16 SENSITIVITY ANALYSIS .................................................................................................. 2-41 2.17 PREVENTION OF OVER-TRAINING .................................................................................. 2-42 2.18 MISSING VALUES IN NEURAL NETWORKS ..................................................................... 2-42 2.19 EXPLORING THE DIFFERENT NEURAL NETWORK OPTIONS ........................................... 2-43 SUMMARY EXERCISES............................................................................................................ 2-49

CHAPTER 3: DECISION TREES/RULE INDUCTION...................... 3-1 3.1 INTRODUCTION................................................................................................................... 3-1 3.2 COMPARISON OF DECISION TREE MODELS ........................................................................ 3-2 3.3 USING THE C5.0 MODEL..................................................................................................... 3-3 3.4 BROWSING THE MODEL...................................................................................................... 3-6 3.5 GENERATING AND BROWSING A RULE SET...................................................................... 3-13 3.6 UNDERSTANDING THE RULE AND DETERMINING ACCURACY ......................................... 3-16 3.7 UNDERSTANDING THE MOST IMPORTANT FACTORS IN PREDICTION............................... 3-24 3.8 FURTHER TOPICS ON C5.0 MODELING ............................................................................. 3-24 3.9 MODELING SYMBOLIC OUTPUTS WITH OTHER DECISION TREE ALGORITHMS ............... 3-28 3.10 MODELING SYMBOLIC OUTPUTS WITH CHAID............................................................. 3-28 3.11 MODELING SYMBOLIC OUTPUTS WITH C&R TREE........................................................ 3-31 3.12 MODELING SYMBOLIC OUTPUTS WITH QUEST............................................................. 3-35 3.13 INTERACTIVE TREES....................................................................................................... 3-37 3.14 PREDICTING NUMERIC FIELDS ....................................................................................... 3-47

i


SUMMARY EXERCISES............................................................................................................ 3-53

CHAPTER 4: LINEAR REGRESSION.................................................. 4-1 4.1 INTRODUCTION................................................................................................................... 4-1 4.2 BASIC CONCEPTS OF REGRESSION ..................................................................................... 4-2 4.3 AN EXAMPLE: ERROR OR FRAUD DETECTION IN CLAIMS.................................................. 4-3 SUMMARY EXERCISES............................................................................................................ 4-15

CHAPTER 5: LOGISTIC REGRESSION ............................................. 5-1 5.1 INTRODUCTION TO LOGISTIC REGRESSION ........................................................................ 5-1 5.2 A MULTINOMIAL LOGISTIC ANALYSIS: PREDICTING CREDIT RISK ................................... 5-4 5.3 INTERPRETING COEFFICIENTS .......................................................................................... 5-14 SUMMARY EXERCISES............................................................................................................ 5-20

CHAPTER 6: DISCRIMINANT ANALYSIS ........................................ 6-1 6.1 INTRODUCTION................................................................................................................... 6-1 6.2 HOW DOES DISCRIMINANT ANALYSIS WORK?.................................................................. 6-1 6.3 THE DISCRIMINANT MODEL............................................................................................... 6-3 6.4 HOW CASES ARE CLASSIFIED ............................................................................................ 6-3 6.5 ASSUMPTIONS OF DISCRIMINANT ANALYSIS ..................................................................... 6-4 6.6 ANALYSIS TIPS ................................................................................................................... 6-5 6.7 COMPARISON OF DISCRIMINANT AND LOGISTIC REGRESSION .......................................... 6-5 6.8 AN EXAMPLE: DISCRIMINANT............................................................................................ 6-6 SUMMARY EXERCISES............................................................................................................ 6-19

CHAPTER 7: DATA REDUCTION: PRINCIPAL COMPONENTS .7-1 7.1 INTRODUCTION................................................................................................................... 7-1 7.2 USE OF PRINCIPAL COMPONENTS FOR PREDICTION MODELING AND CLUSTER ANALYSES 7-1 7.3 WHAT TO LOOK FOR WHEN RUNNING PRINCIPAL COMPONENTS OR FACTOR ANALYSIS. 7-3 7.4 PRINCIPLES ......................................................................................................................... 7-3 7.5 FACTOR ANALYSIS VERSUS PRINCIPAL COMPONENTS ANALYSIS..................................... 7-4 7.6 NUMBER OF COMPONENTS................................................................................................. 7-4 7.7 ROTATIONS......................................................................................................................... 7-5 7.8 COMPONENT SCORES ......................................................................................................... 7-6 7.9 SAMPLE SIZE ...................................................................................................................... 7-6 7.10 METHODS ......................................................................................................................... 7-7 7.11 OVERALL RECOMMENDATIONS........................................................................................ 7-7 7.12 EXAMPLE: REGRESSION WITH PRINCIPAL COMPONENTS................................................. 7-8 SUMMARY EXERCISES............................................................................................................ 7-21

CHAPTER 8: TIME SERIES ANALYSIS ............................................. 8-1 8.1 INTRODUCTION................................................................................................................... 8-1 8.2 WHAT IS A TIME SERIES? ................................................................................................... 8-3 8.3 A TIME SERIES DATA FILE ................................................................................................. 8-5 8.4 TREND, SEASONAL AND CYCLIC COMPONENTS................................................................. 8-7 8.5 WHAT IS A TIME SERIES MODEL? .................................................................................... 8-10 8.6 INTERVENTIONS................................................................................................................ 8-11 8.7 EXPONENTIAL SMOOTHING.............................................................................................. 8-12

ii


8.8 ARIMA ............................................................................................................................ 8-13 8.9 DATA REQUIREMENTS...................................................................................................... 8-15 8.10 AUTOMATIC FORECASTING IN A PRODUCTION SETTING................................................ 8-16 8.11 FORECASTING BROADBAND USAGE IN SEVERAL MARKETS.......................................... 8-17 8.12 APPLYING MODELS TO SEVERAL SERIES ....................................................................... 8-39 SUMMARY EXERCISES............................................................................................................ 8-45

CHAPTER 9: DECISION LIST............................................................... 9-1 9.1 INTRODUCTION................................................................................................................... 9-1 9.2 A DECISION LIST MODEL ................................................................................................... 9-1 9.3 COMPARISON OF RULE INDUCTION MODELS ..................................................................... 9-4 9.4 RULE INDUCTION USING DECISION LIST............................................................................ 9-5 9.5 UNDERSTANDING THE RULES AND DETERMINING ACCURACY ......................................... 9-8 9.6 UNDERSTANDING THE MOST IMPORTANT FACTORS IN PREDICTION............................... 9-14 9.7 EXPERT OPTIONS FOR DECISION LIST .............................................................................. 9-15 9.8 INTERACTIVE DECISION LIST ........................................................................................... 9-18 SUMMARY EXERCISES............................................................................................................ 9-34

CHAPTER 10: FINDING THE BEST MODEL FOR BINARY OUTCOMES ............................................................................................ 10-1

10.1 INTRODUCTION............................................................................................................... 10-1 SUMMARY EXERCISES.......................................................................................................... 10-13

CHAPTER 11: GETTING THE MOST FROM MODELS................ 11-1 11.1 INTRODUCTION............................................................................................................... 11-1 11.2 MODIFYING CONFIDENCE VALUES FOR SCORING.......................................................... 11-1 11.3 META-LEVEL MODELING............................................................................................... 11-6 11.4 ERROR MODELING........................................................................................................ 11-10 SUMMARY EXERCISES.......................................................................................................... 11-18

iii


iv

Predictive Modeling With Clementine

Chapter 1: Preparing Data for Modeling

Overview • Preparing and cleaning data for modeling • Balancing data using the Distribution and Balancing nodes • Transforming the data with the Distribution node • Grouping data with the Binning node • Partition the data into training and testing samples with the Partition node • Detecting unusual cases with the Anomaly Node • Selecting predictors with the Feature Selection Node

Objectives In this chapter, after a brief discussion regarding the cleaning of data, we will introduce several techniques that may be useful in preparing data for modeling.

Data In this chapter we use a data set from a leading telecommunications company, churn.txt. The file contains records for 1477 of the company’s customers who have at one time purchased a mobile phone. It includes such information as length of time spent on local, long distance and international calls, the type of billing scheme and a variety of basic demographics, such as age and gender. The customers fall into one of three groups: current customers, involuntary leavers and voluntary leavers. We want to use data mining to understand what factors influence whether an individual remains as a customer or leaves for an alternative company. The data are typical of what is often referred to as a churn example (hence the file name).

Note about Type Nodes in this Course Streams presented in this course contain Type nodes, although in most instances the Types tab in the Source node would serve the same purpose.

Clementine and Clementine Server By default, Clementine will run in local mode on your desktop machine. If Clementine Server has been installed, then Clementine can be run in local mode or in distributed (client-server) mode. In this latter mode, Clementine streams are built on the client machine, but executed by Clementine Server. Since the data files used in this training course are relatively small, we recommend you run in local mode. However, if you choose to run in distributed mode make sure the training data are either placed on the machine running Clementine Server or that the drive containing the data can be mapped from the server. To determine in which mode Clementine is running on your machine, click Tools…Server Login (from within Clementine) and see whether the Connection option is set to Local or Network. This dialog is shown below.

Preparing Data for Modeling 1 - 1

Predictive Modeling With Clementine Figure 1.1 Server Login Dialog in Clementine

Note Concerning Data for this Course Data for this course are assumed to be stored in the folder c:\Train\ClemPredModel. At SPSS training centers, the data will be located in a folder of that name. Note that if you are running Clementine in distributed (Server) mode (see note above), then the data should be copied to the server machine or the directory containing the data should be mapped from the server machine.

1.1 Introduction Preparing data for modeling can be a lengthy but essential and extremely worthwhile task. If data are not cleaned and modified/transformed as necessary, it is doubtful that the models you build will be successful. In this chapter we will introduce a number of techniques that enable such data preparation. We will begin with a brief discussion concerning the handling of blanks and cleaning of data, although this is covered in greater detail in the Introduction to Clementine and Data Mining and Preparing Data for Data Mining courses. Following this, we will introduce the concept of data balancing and how it is achieved within Clementine. A number of data transformations will also be introduced as possible solutions to skewed data. We will discuss how to create training and validation samples of the data automatically with the use of data partitioning.


Predictive Modeling With Clementine 1.2 Cleaning Data In most cases, data sets contain problems or errors such as missing information and/or spurious values. Before modeling begins, these errors should be removed or at least minimized. The higher the quality of data used in data mining, the more likely it is that predictions or results are accurate. Clementine provides a number of ways to handle blank or missing information and several techniques to detect data irregularities. In this section we will briefly discuss an approach to data cleaning. Note: If there is interest the trainer may refer to the stream Dataprep.str located in the c:\Train\ClemPredModel directory. This stream contains examples of the techniques detailed in the following paragraphs. After the data have been read into Clementine, and if necessary all relevant data sources have been combined, the first step in data cleaning is to assess the overall quality of the data. This often involves:

• Using the Types tab of a source node or the Type node to fully instantiate data, usually achieved by clicking the Read Values button within the source or Type node, or by passing the data from a Type node into a Table node and allowing Clementine to auto-type.

• Flagging missing values (white space, null and value blanks) as blank definitions within a source node or the Type node.

• Using the Data Audit node to examine the distribution and summary statistics (minimum, maximum, mean, standard deviation, number of valid records) for data fields.

Once the condition of the data has been assessed, the next step is to attempt to improve the overall quality. This can be achieved in a variety of ways:

• Using the Generate menu from the Data Audit node’s report, a Select node that removes records with blank fields can be automatically created (particularly relevant for a model’s output field).

• Fields with a high proportion of blank records can be filtered out using the Generate menu from the Data Audit node’s report to create a Filter node.

• Blanks can be replaced with appropriate values using the Filler node. Possible appropriate values within a numeric field can range from the average, mode, or median, to a value predicted using one of the available modeling techniques. In addition, missing values can be imputed by using the Data Audit node.

• The Type node and Types tab in source nodes provide an automatic checking process that examines values within a field to determine whether they comply with the current type and bounds settings. If they do not, fields with out-of-bound values can either be modified, or those records removed from passing downstream.

After these actions are completed, the data will have been cleaned of blanks and out-of-bounds values. It may also be necessary to use the Distinct node to remove any duplicate records. Once the data file has been cleaned, you can then begin to modify it further so that it is suitable for the modeling technique(s) you plan to use.


Predictive Modeling With Clementine 1.3 Balancing Data Once the data have been cleaned you should examine the distribution of the key fields you will be using in modeling, including the output field (if you are creating a predictive model). This is achieved most easily using the Data Audit node, but either the Distribution node (for symbolic data) or the Histogram node (for numeric data) will produce charts for single fields. If the distribution of a symbolic output field is heavily skewed in favor of one of the categories, you may encounter problems when generating predictive models. For example, if only 3% of a mailing database have responded to a campaign, a neural network trained on this data might try to classify every individual as a non-responder to achieve a 97% accuracy—great but not very useful! One solution to overcome this problem is to balance the data, which will overweight the less frequent categories. This can be accomplished with the Balance node, which works by either reducing the number of records in the more frequent categories, or boosting the records in the less frequent categories. It can be automatically generated from the distribution and histogram displays. When balancing data we recommend using the reduce option in preference to the boosting option. The latter duplicates records and thus magnifies problems and irregularities, as only a relatively few cases can be heavily weighted. However, when working with small data sets, data reducing is often not feasible and data boosting is the only sensible solution to imbalances within data.

Note A better solution than balancing data at this stage is to sample from the original dataset(s) to create a training file with a roughly equal number of cases in each category of the output field. The test datasets should, however, match the unbalanced population proportions for this variable to provide a realistic test of the generated models. We will illustrate data balancing by examining the distribution of the field CHURNED within the file churn.txt. This field records whether the customer is current, a voluntary leaver, or an involuntary leaver (we attempt to predict this field in the following chapters).

Open the stream Cpm1.str (located in c:\Train\ClemPredModel) Execute the Table node and familiarize yourself with the data Close the Table window Connect a Distribution node to the Type node Edit the Distribution node and set the Field: to CHURNED Click the Execute button


Predictive Modeling With Clementine Figure 1.2 Distribution of the CHURNED Field

The proportions of the three groups are rather unequal and data balancing may be useful when trying to predict this field using a neural network. This output can be used directly to create a Balance node, but first we must decide whether we wish to reduce or boost the current data. Reducing the data will drop over 73% of the records, but boosting the data will involve duplicating the involuntary leavers from 132 records to over 830. Neither of these methods is ideal but in this case we choose to reduce the data to eliminate the magnification of errors.

Click Generate…Balance Node (reduce) Close the Distribution plot window

A generated Balance node will appear in the Stream Canvas.

Drag the Balance node to the right of the Type node and connect it between the Type and Distribution nodes

Execute the stream from the Distribution node Figure 1.3 Distribution of the CHURNED Field after Balancing the Data


Predictive Modeling With Clementine When balancing data it is advisable to enable a data cache on the balance node, to “fix” the selected sample. This is due to the fact that the balance node is randomly reducing or boosting data and a different sample will be selected each time the data are passed through the node. At this point the data are balanced and can be passed into a modeling node, such as the Neural Net node. Once the model has been built, it is important that the testing and assessment of the model should be done based on the unbalanced data.

Close the Distribution plot window

1.4 Numeric Data Transformations When working with numeric data, the act of data balancing, as detailed above, is a rather drastic solution to the problem of skewed data and usually isn’t appropriate. There are a variety of numerical transformations that provide a more sensible approach to this problem and that result in a flat or flatter distribution. The Derive node can be used to produce such transformed fields within Clementine. To determine which transformation is appropriate, we need to view the data using a histogram. We’ll use the field LOCAL in this example, which measures the number of minutes of local calls per month.

Connect a Histogram node to the Type node Edit the Histogram node and click LOCAL in the Field list Execute the node

Figure 1.4 Histogram of the LOCAL Field


Predictive Modeling With Clementine This distribution has a strong positive skew. This condition may lead to poor performance of a neural network predicting LOCAL since there is less information (fewer records) on those individuals with higher local usage. What we need is a transformation that inverts the original skew, that is, skews it to the left. If we get the transformation correct, the data will become relatively balanced. When you transform data you normally try to create a normal distribution or a uniform (flat) distribution. For our problem, the distribution of LOCAL closely follows that of a negative exponential, e-x, so the inverse is a logarithmic function. We will therefore try a transformation of the form ln(x + a), where a is a constant and x is the field to be transformed. We need to add a small constant because some of the records have values of 0 for LOCAL, and the log of 0 is undefined. Typically the value of a would be the smallest actual positive value in the data.

Close the Histogram plot window Add a Derive node and connect the Type node to it Edit the Derive node and set the Derive Field name to LOGLOCAL Select Formula in the Derive As list Enter log(LOCAL + 3) in the Formula text box (or use the Expression Builder) Click on OK

Figure 1.5 Derive Node to Create LOGLOCAL

Connect a Histogram node to the Derive node Edit the Histogram node and set the Field to LOGLOCAL Click the Execute button


Predictive Modeling With Clementine Figure 1.6 Histogram of the Transformed LOCAL Field Using a Logarithmic Function

Although this distribution is not perfectly normal it is a great improvement on the distribution of the original field.

Close the Histogram plot window The above is a simple example of a transformation that can be used. Table 1.1 gives a number of other possible transformations you may wish to try when transforming data, together with their CLEM expression. Table 1.1 Possible Numerical Transformations

Transformation CLEM Expression

ex Exp(x) where x is the name of the field to be transformed

ln(x+a) Log(x + a) where a is a numerical constant

ln((x-a)/(b-x)) Log((x - a) / (b – x)) Where a and b are numerical constants

log(x+a) Log10(x + a) sqrt(x) Sqrt(x)

1 / e(mean(x)-x)1 / exp(@GLOBAL_AVE(x) – x) where @GLOBAL_AVE is the average of the field x, set using the Set Globals node in the Output palette


Predictive Modeling With Clementine 1.5 Binning Data Values Another method of transforming a numeric field involves modifying it to create a new categorical field (type set or flag) based on the original field’s values. For example, you might wish to group age into a new field based on fixed width categories of 5 or 10 years. Or, you might wish to transform income into a new field based on the percentiles (based on either the count or sum) of income (e.g., quartiles, deciles). This operation is labeled binning in Clementine, since it takes a range of data values and collapses them into one bin where they are all given the same data value. It is certainly true that binning data loses some information compared to the original distribution. On the other hand, you often gain in clarity, and binning can overcome some data distribution problems, including skewness. Moreover, oftentimes there is interest in looking at the effect of a predictor at natural cutpoints (e.g., one standard deviation above the mean). In addition, when performing data understanding, it might be easier to view the relationship between two or more continuous variables if at least one is binned. Binning can be performed with bins based on fixed widths, percentiles, the mean and standard deviation, or ranks. We can use the original field LOCAL to show an example of binning. We know this field is highly positively skewed, and it has many distinct values. Let’s group the values into five bins by requesting binning by quintiles, and then examine the relationship of the binned field to CHURNED. The Binning node is located in the Field Ops palette.

Add a Binning node to the stream near the Type node Connect the Type node to the Binning node Edit the Binning node and set the Bin fields to LOCAL Click OK Click the Binning method dropdown and select Tiles (equal count) method Click the Quintiles (5) check box

By default, a new field will be created from the original field name with the suffix _TILEN, where N stands for the number of bins to be created (here five). Percentiles can be based on the record count (in ascending order of the value of the bin field, which is the standard definition of percentiles), or on the sum of the field.


Predictive Modeling With Clementine Figure 1.7 Completed Binning Node to Group LOCAL by Quintiles

The Generate tab allows you to view the bins that have been created and their upper and lower limits. However, understandably, information on generated bins is not available until the node has been executed in order to allow the thresholds to be determined.

Click OK To study the relationship between binned LOCAL (LOCAL_TILE5) and CHURNED, we could use a Matrix node, since both fields are categorical, but we can also use a Distribution node, which will be our choice here.

Add a Distribution node to the stream and attach it to the Binning node Edit the Distribution node and select LOCAL_TILE5 as the Field Select CHURNED as the Overlay field Click Normalize by color checkbox (not shown) Click Execute


Predictive Modeling With Clementine Figure 1.8 Distribution of CHURNED by Binned LOCAL

There is an interesting pattern apparent. Essentially all the involuntary churners are in the first quintile of LOCAL_TILE5 (notice how the number of cases in each category is almost exactly the same). Perhaps we got lucky when specifying quintiles as the binning technique, but we have found a clear pattern that might not have been evident if LOCAL had not been binned. We would next wish to know what the bounds are on the first quintile, and to see that we need to edit the Binning node.

Close the Distribution plot window Edit the Binning node for LOCAL Click the Bin Values tab Select 5 from the Tile: menu


Predictive Modeling With Clementine Figure 1.9 Bin Thresholds for LOCAL

We observe that the upper bound for Bin 1 is 10.38 minutes. That means that the involuntary churners essentially all made less than 10.38 minutes of local calls, since they all fall into this bin (quintile). Given this finding, we might decide to use the binned version of LOCAL in modeling, or try two models, one with the original field and then one with the binned version.

1.6 Data Partitioning Models that you build (train) must be assessed with separate testing data that was not used to create the model. The training and testing data should be created randomly from the original data file. They can be created with either a Derive or Sample node, but the Partition node allows greater flexibility. With the Partition node, Clementine has the capability to directly create a field that can split records between training, testing (and validation) data files. Partition nodes generate a partition field that splits the data into separate subsets or samples for the training and testing stages of model building. When using all three subsets, the model is built with the training data, refined with the testing data, and then tested with the validation data. The Partition node creates a field of type set with the direction automatically set to Partition. The set field will either have two values (corresponding to the training and testing files), or three values (training, testing, and validation). Clementine model nodes have an option to enable partitioning, and they will recognize a field with direction “partition” automatically (as will the Evaluation node). When a generated model is created, predictions will be made for records in the testing (and validation) samples, in addition to


Predictive Modeling With Clementine the training records. Because of this capability, the use of the Partition node makes model assessment more efficient. To illustrate the use of data partitioning, we will create a partition field for the churn data with two values, for training and testing. Although the Partition node assists in selecting records for training and testing, its output is a new field, and so it can be found in the Field Ops Palette.

Add a Partition node to the stream and connect the Type node to it Edit the Partition node

The name of the partition field is specified in the Partition field text box. The Partitions choice allows you to create a new field with either 2 or 3 values, depending on whether you wish to create 2 or 3 data samples. The size of the files is specified in the partition size text boxes. Size is relative and given in percents (which do not have to add to 100%). If the sum of the partition sizes is less than 100%, the records not (randomly) included in a partition will be discarded. The Generate menu allows you to create Select nodes that will select records in the training, testing, and validation samples. We’ll change the size of the training and testing partitions, and input a random seed so our results are comparable. Figure 1.10 Partition Node Settings



Change the Training partition size: to 70 Change the Testing partition size: to 30 Change the Seed value to 999 Click OK Attach a Distribution node to the Partition node Edit the Distribution node and select Partition in the Field list Execute the Distribution node

Figure 1.11 Distribution of the Partition Field

The new field Partition has close to a 70/30 distribution. It can now be directly used in modeling as described above, or separate files can be created with use of the Select node. We will use the partition field in a later chapter, so we’ll save the stream for use in later chapters.

Click on File…Save Stream As Save the stream with the name Chapter1_Partition

1.7 Anomaly Detection Data mining usually involves very large data files, sometimes with millions of records. In such situations, we may not be concerned about whether some records are odd or unusual based on how they compare to the bulk of records in the file. Odd cases, unless they are relatively frequent (and then they can hardly be labeled “unusual”), will not cause problems to most algorithms when we try to predict some outcome. For those of us with smaller data files, though, anomalous records can be a concern, as they can distort the outcomes of a modeling process. The most salient example of this comes from classical statistics, where regression, and other methods that fall under the rubric of the General Linear Model, can be strongly affected by outliers and deviant points. Clementine includes an Anomaly node that searches for unusual cases in an automatic manner. Anomaly detection is an exploratory method designed for a quick detection of unusual cases or records that should be candidates for further analysis. These should be regarded as suspected anomalies, which, on closer examination, may or may not turn out to be real. You may find that a record is perfectly valid but choose to screen it from the data for purposes of model building. Alternatively, if the algorithm repeatedly turns up false anomalies, this may point to an error in the data collection process. The procedure is based on clustering the data using a set of user-specified fields. A case that is deviant compared to the norms (distributions) of all the cases in that cluster is deemed anomalous.


Predictive Modeling With Clementine The procedure helps you quickly detect unusual cases during data exploration before you begin modeling. It is important to note that the definition of an anomalous case is statistical and not particular to any specific industry or application, such as fraud in the finance or insurance industry (although it is possible that the technique might find such cases). Clustering is done using the TwoStep cluster routine (also available in the TwoStep node). In addition to clustering, the Anomaly node scores each case to identify its cluster group and creates an anomaly index, to measure how unusual it is, and identifies which variables contribute most to the anomalous nature of the case. We’ll use a new data file to demonstrate the Anomaly node’s operation. The file, customer_dbase.sav, is a richer data file that is also from a telecommunications company. It has an outcome field churn which measures whether a customer switched providers in the last month. There is no target field for anomaly detection, but in most instances you will want to use the same set of variables in the Anomaly node that you plan to use for modeling. There is an existing stream file we can use for this example. The Anomaly node is found in the Modeling palette since it uses the TwoStep clustering routine.

Click File…Open Stream Double-click on Anomaly_FeatureSelect.str in the c:\Train\ClemPredModel directory Execute the Table node and view the data Close the Table window Place an Anomaly node in the stream and connect it to the Type node Edit the Anomaly node, and then click the Fields tab

Figure 1.12 Anomaly Node Fields Tab

You will typically specify exactly which fields should be used to search for anomalous cases. In these data, there are several fields that measure various aspects of the customer’s account, and we want to use all these here (there are also demographic fields, but in the interests of keeping this example relatively simple, we will restrict somewhat the number and type of fields used).



Click Use custom settings button Click the Field chooser button, and select all the fields from longmon to ebill (they are

contiguous) Click OK Click the Model tab

Figure 1.13 Anomaly Nodel Model Settings

By default, the procedure will use a cutoff value that flags 1% of the records in the data. The cutoff is included as a parameter in the model being built, so this option determines how the cutoff value is set for modeling but not the actual percentage of records to be flagged during scoring. Actual scoring results may vary depending on the data. The Number of anomaly fields to report specifies the number of fields to report as an indication of why a particular record is flagged as an anomaly. The most anomalous fields are defined as those that show the greatest deviation from the field norm for the cluster to which the record is assigned. We’ll use the defaults for this example.

Click Execute Right-click on the Anomaly model in the Models Manager, and select Browse Click Expand All button


Predictive Modeling With Clementine Figure 1.14 Browsing Anomaly Generated Model Results

We see that three clusters (labeled “Peer Groups”) were created automatically (although we didn’t view the Expert options, the default number of clusters to be created is set between 1 and 15). In the first cluster there are 1267 records, and 18 have been flagged as anomalies (about 1.4%, close to the 1% cutoff value). The Model browser window doesn’t tell us which cases are anomalous in this cluster, but it does provide a list of fields that contributed to defining one or more cases as anomalous. Of the 18 records identified by the procedure, 16 are anomalous on the field lnwireten (the log of wireless usage over tenure in months [time as a customer]). This was a derived field created earlier in the data exploration process. The average contribution to the anomaly index from lnwireten is .275. This value should be used in a relative sense in comparison to the other fields. To see information for specific records we need to add the generated Anomaly model to the stream. We will sort the records by the $O-AnomalyIndex field, which contains the index values.

Add the Anomaly generated model node to the stream near the Type node and connect the two

Add a Sort node from the Record Ops palette to the stream and connect the Anomaly generated model node to the Sort node

Edit the Sort node and select the field $O-AnomalyIndex as the sort field Change the Sort Order to Descending


Predictive Modeling With Clementine Figure 1.15 Sorting Records by Anomaly Index

Click OK Connect a Table node to the Sort node Execute the Table node

Figure 1.16 Records Sorted by Anomaly Index with Fields Generated by Anomaly Model

For each record, the model creates 9 new fields. The field #O-PeerGroup contains the cluster membership. The next six fields contain the top three fields that contributed to this record being an anomaly and the contribution of that field to the anomaly index (we can request fewer or more fields on which to report in the Anomaly node Model tab). Thus we see that the three most anomalous cases, with an anomaly index of 5.0, all are in cluster 2. The first two of these are most deviant on longmon and longten.


Predictive Modeling With Clementine Knowing which variables made the greatest contribution to the anomaly index allows you to more easily review the data values for these cases. You don’t need to look at all the fields, but instead can concentrate on specific fields detected by the model for that case. In the interests of time, we won’t take this next step here, but you might want to try this in the exercises. What we can briefly show are the options available when an Anomaly generated model is added to the stream.

Close the Table window Edit the Anomaly generated model node in the stream Click on the Settings tab

Figure 1.17 Settings Tab Options for Anomaly Generated Models

Note in particular that in large files, there is an option available to discard non-anomalous records, which will make investigating the anomalous records much easier. Also, you can change the number of fields on which to report here.

Close the Anomaly model Browser window

1.8 Feature Selection for Models Just as data files can have many records in data-mining problems, there are often hundreds, or thousands, of potential fields that can be used as predictors. Although some models can naturally use many fields—decision trees, for example—others cannot or are inefficient, at best, with too many fields. As a result, you may have to spend an inordinate amount of time to examine the fields to decide which ones should be included in a modeling effort.


Predictive Modeling With Clementine To shortcut this process and narrow the list of candidate predictors, the Feature Selection node can identify the fields that are most important—mostly highly related—to a particular target/outcome field. Reducing the number of fields required for modeling will allow you to develop models more quickly, but also permit you to explore the data more efficiently. Feature selection has three steps:

1) Screening: In this first step, fields are removed that have too much missing data, too little variation, or too many categories, among other criteria. Also, records are removed with excessive missing data.

2) Ranking: In the second step, each predictor is paired with the target and an appropriate test of the bivariate relationship between the two is performed. This can be a crosstabulation for categorical variables or a Pearson correlation coefficient if both variables are continuous. The probability values from these bivariate analyses are turned into an importance measure by subtracting the p value of the test from 1 (thus a low p value leads to an importance near 1). The predictors are then ranked on importance.

3) Selecting: In the final step, a subset of predictors is identified to use in modeling. The number of predictors can be identified automatically by the model, or you can request a specific number.

Feature selection is also located in the Modeling palette and creates a generated model node. This node, though, does not add predictions or other derived fields to the stream. Instead, it acts as a filter node, removing unnecessary fields downstream (with parameters under user control). We’ll try feature selection on the customer database file. Note that although we are using feature selection after demonstrating anomaly detection, you may want to use these two in combination. For example, you can first use feature selection to identify important fields. Then you can use anomaly detection to find unusual cases on only those fields.

Add a Feature Selection node to the stream and connect it to the Type node Edit the Feature Selection node and click the Fields tab Click the Use custom settings button Select churn as the Target field (not shown) Select all the fields from region to news as Inputs (be careful not to select churn again) Click the Model tab


Predictive Modeling With Clementine Figure 1.18 Model Tab for Feature Selection to Predict Churn

By default fields will initially be screened based on the various criteria listed in the Model tab. A field can have no more than 70% missing data (which is rather generous, and you may wish to modify this value). There can be no more than 90% of the records with the same value, and the minimum coefficient of variation (standard deviation/mean) is 0.1. All of these are fairly liberal standards.

Click the Options tab Figure 1.19 Options for Feature Selection


Predictive Modeling With Clementine Fields after being ranked will be selected based on importance, and only those deemed Important will be selected in the model. This can be changed to select the top N fields, by ranking of importance, or by selecting all fields that meet a minimum level of importance. Four options are available for determining the importance of categorical predictors, with the default being the Pearson chi-square value. We will use all default settings for these data.

Click Execute Right-click on the churn Feature Selection generated model and select Browse

Figure 1.20 Feature Selection Browser Window

We selected 127 potential predictors. Seven were rejected in the screening stage because of too much missing data or too little variation. Then of the remaining 120 fields, the model selected 63 as being important, so it has reduced our tasks of data review and model building considerably.


Predictive Modeling With Clementine The model ranked the fields by importance (importance is rounded off to a maximum value of 1.000). If you scroll down the list of fields in the upper pane, you will eventually see fields with low values of importance that are unrelated to churn. All variables with their box checked will be passed downstream if this node is added to a data stream. The set of important variables includes a mix, with some demographic (age, employ), account-related (tenure, ebill), and financial status (cardtenure) types. From here, we can add the generated Feature Selection model to the stream, and it will filter out the unimportant variables.

Note When using the Feature Selection node, it is important to understand its limitations. First, importance of a relationship is not the same thing as the strength of a relationship. In data mining, the large data files used allow very weak relationships to be statistically significant. So just because a variable has an importance value near 1 does not guarantee that it will be a good predictor of some target variable. Second, nonlinear relationships will not necessarily be detected by the tests used in the Feature Selection node, so a field could be rejected yet have the potential of being a good predictor (this is especially true for continuous predictors).


Predictive Modeling With Clementine Summary Exercises

A Note Concerning Data Files In this training guide files are assumed to be located in the c:\Train\ClemPredModel directory.

Exercise Data Information Sheets The exercises in this chapter is written around the data files churn.txt. The following section give details of the file. churn.txt contains information from a telecommunications company. The data are comprised of customers who at some point have purchased a mobile phone. The primary interest of the company is to understand which customers will remain with the organization or leave for another company. The files contain the following fields: ID Customer reference number LONGDIST Time spent on long distance calls per month International Time spent on international calls per month LOCAL Time spent on local calls per month DROPPED Number of dropped calls PAY_MTHD Payment method of the monthly telephone bill LocalBillType Tariff for locally based calls LongDistanceBillType Tariff for long distance calls AGE Age SEX Gender STATUS Marital status CHILDREN Number of Children Est_Income Estimated income Car_Owner Car owner CHURNED (3 categories) Current – Still with company Vol – Leavers who the company wants to keep Invol – Leavers who the company doesn’t want In this session we will perform some exploratory analysis on the Churn.txt data file and prepare these data so that they are ready for modeling.

1. Read the file c:\Train\ClemPredModel\Churn.txt—this file is blank delimited and includes field names—using a Var. File node. Browse the data and familiarize yourself with the data structure within each field.

2. Check to see if there are blanks (missing values) within the data; if you find any

problems, decide how you wish to deal with these and take appropriate steps.

3. Look at the distribution of the CHURNED field. This field probably requires balancing. Do you think it better to balance using “boosting” or “reducing” of the data?



4. If you think that both of these methods are too harsh (either in terms of duplicating data too much or reducing data so there are too few cases), edit the balance node and see if you can find a way of reducing the impact of balancing.

5. If you are going to use this data for modeling, do you wish to cache this node?

6. Use the Data Audit node to look at the distribution of some of the fields that will be used

as inputs. Does the distribution of these fields appear appropriate? If not, try and find a transformation that may help the modeling process. (Note: The instructor may have already spoken about the field LOCAL—you may want to transform this field, as discussed in Chapter 1).

7. Look at the field International. Do you think this field will need transforming or binning?

Can you find a transformation that helps with this field? If not, why do think this is?

8. Think about whether there are potentially any other fields that could be derived from existing data that may help out with the modeling process. If so, create those fields.

9. Try using the Anomaly node on these data to detect unusual records. Don’t use the field

CHURNED. Do you find any commonalities among most of the anomalous records?

10. If you have made any data transformations, balanced the data, or derived any fields, you may want to create a Supernode that reduces the size of your current stream.

11. Save your stream as Exer1.str.

For those with extra time. Use the Anomaly node to detect anomalous cases in the customer_dbase.sav file, as we did in the chapter. Then add the generated Anomaly node to the stream and investigate these unusual cases in more detail. Would you retain them for modeling, or not? Why?



Chapter 2: Neural Networks Overview

• Introduce the Neural Net node • Build a neural network • Introduce the Generated Models palette • Browse and interpret the results • Evaluate the model • Overview the different types of neural network training methods available within

Clementine • Illustrate how and when to use the expert options within the Neural Net node • Discuss the use of sensitivity analysis and prevention of over-training • Review how missing values are handled by the Neural Net node

Objectives In this chapter we show how to build a neural network with Clementine. The resulting model will be browsed and the output explained. In addition, we will introduce the different training methods and discuss the types of algorithms available within Clementine. We will then illustrate how and when to use the expert options within the Neural Net node. Finally, we will discuss the uses of sensitivity analysis and how to prevent over-training.

Data In this chapter we will use the data set introduced in the previous chapter, churn.txt. The data contain information on 1477 of the company’s customers who have at some time purchased a mobile phone. The customers fall into one of three groups: current customers, involuntary leavers and voluntary leavers. In this chapter we want to use data mining to understand what factors influence whether an individual remains as a customer or leaves for a competitor. The file contains information including length of time spent on local, long distance and international calls, the type of billing scheme and a variety of basic demographics, such as age and gender of the customer. Following recommended practice, we will use a Partition Node to divide the cases into two partitions (subsamples), one to build or train the model and the other to test the model (often called a holdout sample). With a holdout sample, you are able to check the resulting model performance on data not used to fit the model. The holdout data sample also has known values for the outcome field and therefore can be used to check model performance.

2.1 Introduction to Neural Networks Historically, neural networks attempted to solve problems using methods modeled on how the brain operates. Today they are generally viewed as powerful modeling techniques.

Neural Networks 2 - 1

Predictive Modeling With Clementine A typical neural network consists of several neurons arranged in layers to create a network. Each neuron can be thought of as a processing element that is given a simple part of a task. The connections between the neurons provide the network with the ability to learn patterns and interrelationships in data. The figure below gives a simple representation of a common neural network (a Multi-Layer Perceptron). Figure 2.1 Simple Representation of a Common Neural Network

When using neural networks to perform predictive modeling, the input layer contains all of the fields used to predict the outcome. The output layer contains an output field: the target of the prediction. The input and output fields can be numeric or symbolic (in Clementine, symbolic fields are transformed into a numeric form (dummy or binary set encoding) before processing by the network). The hidden layer contains a number of neurons at which outputs from the previous layer combine. A network can have any number of hidden layers, although these are usually kept to a minimum. All neurons in one layer of the network are connected to all neurons within the next layer While the neural network is learning the relationships between the data and results, it is said to be training. Once fully trained, the network can be given new, unseen data and can make a decision or prediction based upon its experience. When trying to understand how a neural network learns, think of how a parent teaches a child how to read. Patterns of letters are presented to the child and the child makes an attempt at the word. If the child is correct she is rewarded and the next time she sees the same combination of letters she is likely to remember the correct response. However, if she is incorrect, then she is told the correct response and tries to adjust her response based on this feedback. Neural networks work in the same way. Clementine provides two different classes of supervised neural networks, the Multi-Layer Perceptron (MLP) and the Radial Basis Function Network (RBFN). In this course we will concentrate on the MLP type network and the reader is referred to the Clementine 11.1 Node Reference for more details on the RBFN approach to neural networks. Within a Multi-Layer Perceptron (MLP), each hidden layer neuron receives an input based on a weighted combination of the outputs of the neurons in the previous layer. The neurons within the


Predictive Modeling With Clementine final hidden layer are, in turn, combined to produce an output. This predicted value is then compared to the correct output and the difference between the two values (the error) is fed back into the network, which in turn is updated. This feeding of the error back through the network is referred to as back-propagation. To illustrate this process we will take the simple example of a child learning the difference between an apple and a pear. The child may decide in making a decision that the most useful factors are the shape, the color and the size of the fruit—these are the inputs. When shown the first example of a fruit she may look at the fruit and decide that it is round, red in color and of a particular size. Not knowing what an apple or a pear actually looks like, the child may decide to place equal importance on each of these factors—the importance is what a network refers to as weights. At this stage the child is most likely to randomly choose either an apple or a pear for her prediction. On being told the correct response, the child will increase or decrease the relative importance of each of the factors to improve her decision (reduce the error). In a similar fashion a MLP network begins with random weights placed on each of the inputs. On being told the correct response, the network adjusts these internal weights. In time, the child and the network will hopefully make correct predictions.

2.2 The Neural Network Node The Neural Net node is used to create a neural network and can be found in the Modeling palette. Once trained, a Generated Neural Net node labeled with the name of the predicted field will appear in the Generated Models palette. This node represents the trained neural network. Its properties can be browsed and new data can be passed through this node to generate predictions. We will investigate the properties of the trained network node later in this chapter. Before a data stream can be used by the Neural Net—or any node in the Modeling palette—the types of all fields used in the model must be defined (either in the source node or a Type node). That is because all modeling nodes use this information to set up the models. As a reminder, the table below shows the five available directions for a field. Table 2.1 Direction Settings

IN The field acts as an input or predictor within the modeling. OUT The field is the output or target for the modeling. BOTH Allows the field to be act as both an input and an output in modeling. Direction

suitable for the association rule and sequence detection algorithms only, all other modeling techniques will ignore the field.

NONE The field will not be used in machine learning or statistical modeling. Default if the field is defined as Typeless.

PARTITION Indicates a field used to partition the data into separate samples for training, testing, and (optional) validation purposes.

Direction can be set by clicking in the Direction column for a field within the Type node or the Type tab of a source node and selecting the direction from the drop-down menu. Alternatively, this can be done from the Fields tab of a modeling node.

If the Stream Canvas is not empty, click File…New Stream Place a Var. File node from the Sources palette



Double-click the Var. File node Move to the c:\Train\ClemPredModel directory and double-click on the Churn.txt file Click, if not already checked, the Read field names from file check box As delimiter, check the Comma option Set the Strip lead and trail spaces: option to Both Click OK to return to the Stream Canvas Place a Partition node from the Field Ops palette to the right of the Var. File node named

Churn.txt Connect the Var.File node named Churn.txt to the Partition node Place a Type node from the Field Ops palette to the right of the Partition node Connect the Partition Node to the Type node

Next we will add a Table node to the stream. This not only will force Clementine to instatiate the data but also will act as a check to ensure that the data file is being correctly read.

Place a Table node from the Output palette above the Type node in the Stream Canvas Connect the Type node to the Table node Right-click the Table node Execute the Table node

The values in the data table look reasonable (not shown).

Click File…Close to close the Table window Double-click the Type node Click in the cell located in the Type column for ID (current value is Range), and select

Typeless from the list Click in the cell located in the Direction column for CHURNED (current value is In) and

select Out from the list


Predictive Modeling With Clementine Figure 2.2 Type Node Ready for Modeling

Notice that ID will be excluded from any modeling as the direction is automatically set to None for a Typeless field. The CHURNED field will be the output field for any predictive model and all fields but ID and Partition will be used as predictors.

Click OK Place a Neural Net node from the Modeling palette to the right of the Type node Connect the Type node to the Neural Net node

Figure 2.3 Neural Net Node (CHURNED) Added to Data Stream


Predictive Modeling With Clementine Notice that once the Neural Net node is added to the data stream, its name becomes CHURNED, the field we wish to predict. The name can be changed (among other things), by editing the Neural Net node.

Double-click the Neural Net node Figure 2.4 Neural Net Dialog: Model Tab

The name of the Network, which, by default, will also be used as the name for the Neural Net and the generated model node, can be entered in the Model name Custom text box. The Use partitioned data option is checked so that the Neural Net node will only use the Training cases on the Partition field to build the model and hold out the Testing cases for testing purposes. If this option is left unchecked, this node will ignore the Partition field and use all of the cases to build the model. There are six different algorithms available within the Neural Net node. The Quick method uses a feed-forward back-propagation network whose topology (number and configuration of nodes in the hidden layer) is based on the number and types of the input and output fields. For details on the other neural network methods, the reader is referred to the Clementine 11.1 Node Reference. Over-training is one of the problems that can occur within neural networks. As the data pass repeatedly through the network, it is possible for the network to learn patterns that exist in the sample only and thus over-train. That is, it will become too specific to the training sample data and loose its ability to generalize. By selecting the Prevent overtraining option (checked by


Predictive Modeling With Clementine default), only a randomly selected proportion of the training data is used to train the network (this is separate from a holdout sample discussed above). By default, 50% of the data is selected for training the model, and 50% for testing it. Once the training proportion of data has made a complete pass through the network, the rest is reserved as a test set to evaluate the performance of the current network. By default, this information determines when to stop training and provides feedback information. We advise you to leave this option turned on. Note that by checking both the Use partitioned data and the Prevent overtraining options, the Neural Net model will be trained on 50 percent of the training sample selected by the Partition Node, and not on half of the entire data set. You can control how Clementine decides to stop training a network. By default, Clementine stops when it appears to have reached its optimally trained state; that is, when accuracy in the test data set seems to no longer improve. Alternatively, you can set a required accuracy value, a limit to the number of cycles through the data, or a time limit in minutes. In this chapter we use the default option. Since the neural network initiates itself with random weights, the behavior of the network can be reproduced using the Set random seed option and entering the same seed value. Although we do it here to reproduce the results in the guide, setting the random seed is not a normal practice and it is advisable to run several trials on a neural network to ensure that you obtain similar results using different random starting points. The Optimize option allows you to make a tradeoff between speed and memory usage. Select Speed to never have Neural Net use disk space for memory in order to improve performance. Alternatively, select Memory to use available disk space when appropriate, at some sacrifice to speed. By default, optimize for memory is selected, and we recommend leaving it at this setting unless your computer is low in installed memory. The Options tab allows you to customize some settings:

Click the Options tab


Predictive Modeling With Clementine Figure 2.5 Neural Net Options Tab

The Use binary set encoding option uses an alternative method of coding fields of type Set when they are used in the Neural Net node. It is more efficient and thus can have benefits when Set fields included in the model have a large number of values. By default, a feedback graph appears while the network is training and gives information on the current accuracy of the network. We will describe the feedback graph in more detail later. By default, a model will be generated from the best network found (based on the test data), but there is an option to generate a model from the final network trained. This can be used if you wish to stop the network at different points to examine intermediate results, and then pick up network training from where it left off (you will need to check the Continue training existing model as well). Sensitivity analysis provides a measure of relative importance for each of the fields used as inputs to the network and is helpful in evaluating the predictors. We will retain this option as well. The Expert tab allows you to refine the properties (for example, the network topology and training parameters) of the training method. Expert options are detailed in Clementine 11.1 Node Reference. Initially, we will keep the default settings on the above options. The Fields tab can be used to override the Type node direction settings and directly select the predictors and outcome field for the neural net.


Predictive Modeling With Clementine To reproduce the results in this training guide:

Click the Model tab Click the Set random seed check box Type 233 into the Seed: text box Click Execute

Note that if different models are built from the same stream using different inputs, it may be advisable to change the Neural Net node names for clarity. Figure 2.6 Feedback Graph During Network Training

Clementine passes the data stream to the Neural Net node and begins to train the network. A feedback graph similar to the one shown above appears on the screen (although it may not appear for this small data file if your computer is reasonably fast). The graph contains two lines. The red, more irregular line labeled Current Predicted Accuracy, presents the accuracy of the current network in predicting the test data set. The blue, smoother line, labeled Best Predicted Accuracy, displays the best accuracy so far on the test data. Training can be paused by clicking the Stop execution button in the Toolbar (this button can be found next to the Execute buttons). Once trained the network performs, if requested, the sensitivity analysis and a diamond-shaped node with the neural net symbol appears in the Models palette. This represents the trained network and is labeled with the output field name.


Predictive Modeling With Clementine Figure 2.7 Generated Neural Net Node Appearing in Models Palette

2.3 Models Palette The Models tab in the Manager holds and manages the results of the machine learning and statistical modeling operations. There are two context menus available within the Models palette. The first menu applies to the entire model palette.

Right-click in the background (empty area) in the Models palette Figure 2.8 Context Menu in the Models Palette


Predictive Modeling With Clementine This menu allows you to open a model in the palette, save the models palette and its contents, open a previously saved models palette, clear the contents of the palette, or to add the generated models to the Modeling section of the CRISP-DM project window. If you use SPSS Predictive Enterprise Services to manage and run your data mining projects, you can store the palette, retrieve a palette or model from the Predictive Enterprise Repository. The second menu is specific to the generated model nodes.

Right-click the generated Neural Net node named CHURNED in the Models palette Figure 2.9 Context Menu for Nodes in the Models Palette

This menu allows you to rename, annotate, and browse the generated model node. A generated model node can be deleted, exported in either PMML (Predictive Model Markup Language) or stored in the Predictive Enterprise Repository, or saved in a file for future use. We will first browse the model.

Click Browse For more information on a section simply expand the section by double-clicking the section (or click the Expand All button to expand all sections at once). To start with we will take a closer look at the Analysis section.

Expand the Relative Importance of Inputs folder The output is shown in Figure 2.10. The Analysis section displays information about the neural network. The predicted accuracy for this neural network is 75.2%, indicating the proportion of the test set (used to prevent overtraining) correctly predicted. The input layer is made up of one neuron per numeric or flag type field. Set type fields will have one neuron per value within the set (unless binary encoding is used). In this example, there are twelve numeric or flag fields and one set field with three values, totaling fifteen neurons. In this network there is one hidden layer, containing three neurons, and the output layer contains three neurons corresponding to the three values of the output field, CHURNED. If the output field had been defined as numeric then the output layer would only contain one neuron.


Predictive Modeling With Clementine The input fields are listed in descending order of relative importance. Importance values can range from 0.0 and 1.0, where 0.0 indicates unimportant and 1.0 indicates extremely important. In practice this figure rarely goes above 0.40 or so. Here we see that International is the most important field within this current network, followed by LONGDIST and SEX. Figure 2.10 Browsing the Generated Net Node

The sections Fields, Build Settings and Training Summary contain technical details and we will skip these sections.

Click File…Close to close the CHURNED Neural Net output window

2.4 Validating the List of Predictors Because most Data Mining techniques have no associated statistics with which to validate the model, it is important to rerun the model with a different seed to be sure that the results are consistent. It is entirely possible that because of the seed we chose that one or more of the fields the Neural Network found to be important in influencing Churn might not be selected again with a different seed. Therefore, it is crucial to run the Neural Network model enough times until you are convinced about which predictors are the most important in influencing your outcome. We


Predictive Modeling With Clementine will rerun the model just once and compare it with the one we just ran. Normally, you would need to rerun it several times.

Double-click the Neural Net node Change the random seed in the Seed: text box from 233 to 444 Click Execute Right-click on the generated Neural Net node named CHURNED in the Models palette. Click Browse Expand the Relative Importance of Inputs folder

Figure 2.11 Browsing the Generated Net Node after Changing the Seed

While the same three fields as in the previous model were chosen as the most important again, notice the second model ranked SEX as more important than LONGDIST. Also note that the accuracy has jumped from 75.2 to 80.7 percent. Normally we would rerun the model again to further convince ourselves that these are indeed the top three predictors of CHURNED, but we will stop here and attempt to further understand the model.

Click File…Close to close the CHURNED Neural Net output window

2.5 Understanding the Neural Network A common criticism of neural networks is that they are opaque; that is, once built, the reasoning behind their predictions is not clear. For instance, does making a lot of international calls mean


Predictive Modeling With Clementine that you are likely to remain a customer, or leave. In the following sections we will use some techniques available in Clementine to help you evaluate the network and discover its simple structure.

Creating a Data Table Containing Predicted Values Generated model nodes can be placed on the Stream Canvas and treated in the same way as the other operational nodes within Clementine. The data can be passed through them and they perform an operation (adding model-based fields to the stream).

Move the Neural Net node named CHURNED higher in the Stream Canvas Place the generated Neural Net model named CHURNED from the Models palette to

the right of the Type node on the Stream Canvas Connect the Type node to the generated Neural Net model named CHURNED Place a Table node below the generated Neural Net model named CHURNED Connect the generated Neural Net model named CHURNED to the Table node

Figure 2.12 Placing a Generated Model on the Stream Canvas

Execute the new Table node


Predictive Modeling With Clementine Figure 2.13 Table Showing the Two Fields Created by the Generated Net Node

The generated Neural Net node calculates two new fields, $N-CHURNED and $NC-CHURNED, for every record in the data file. The first represents the predicted CHURNED value and the second a confidence value for the prediction. The latter is only appropriate for symbolic outputs and will be in the range of 0.0 to 1.0, with the more confident predictions having values closer to 1.0.

Close the Table window

Comparing Predicted to Actual Values When predicting a symbolic field, it is valuable to produce a data matrix of the predicted values ($N-CHURNED) and the actual values (CHURNED) in order to study how they compare and where the differences are. In data mining projects it is advisable to see not only how well the model performed with the data we used to train the model, but also the data we held out for testing purposes. Because the Matrix node does not have an option to automatically split the results by partition we must manually divide the Training and Testing samples with Select nodes. This will allow us to create a separate matrix table for each sample.

Place two Select nodes from the Record Ops palette, one to the lower right of the generated Neural Net model named CHURNED and one to the lower left

Connect the generated Neural Net model named CHURNED to each Select node First we will edit the Select node on the left that we will use to select the Training sample cases:

Double click on the Select node on the left to edit it



Click the Expression Builder button Move Partition from the Fields list box to the Expression Builder text box Click the (equal sign button)

Click the Select from existing field values button and insert the value 1_Training Click OK

Figure 2.14 Completed Selection for the Training Partition

Click OK Now we will edit the Select node on the right to select the Testing sample cases:

Double click on the Select node on the right to edit it


Click the Select from existing field values button and insert the value 2_Testing Click OK and OK

Now, attach a Matrix node to each of the Select nodes. For each of the Select nodes:

Place a Matrix node from the Output palette below each Select node Connect the Matrix nodes to the Select nodes Double-click on each Matrix node to edit it Put CHURNED in the Rows: Put $N-CHURNED in the Columns: Click the Appearance tab Click the Percentage of row option



Click on the Output tab and Output Name Custom, For the Training sample, type Training and for the Testing sample, Testing (this will make it easier to keep track of which output we are looking at)

Click OK

For each actual churn category, the Percentage of row choice will display the percentage of records predicted into each of the outcome categories. Figure 2.15 Updated Stream with new Select and Matrix Nodes

Execute each Matrix node


Predictive Modeling With Clementine Figure 2.16 Matrix of Actual (Rows) and Predicted (Columns) Churned for Training and Testing Samples

For the training data, the model is predicting 78.5% of the current customers, 82.0% of the voluntary leavers and 100% of the involuntary leavers. It is of course up to the researcher to decide whether these are acceptable levels of accuracy. The results for the Testing sample are very similar which suggests that the model will work well with unseen data. When you decide whether to accept a model, and you report on its accuracy, you should use the results from the Testing (or Validation) sample. The model’s performance on the Training data may be too optimized for that particular sample, so its performance on the Testing sample will be the best indication of its performance in the future.

Close the Matrix windows

Evaluation Chart Node The Evaluation Chart node offers an easy way to evaluate and compare predictive models in order to choose the best model for your application. Evaluation charts show how models perform in predicting particular outcomes. They work by sorting records based on the predicted value and confidence of the prediction, splitting the records into groups of equal size (quantiles), and then plotting the value of a criterion for each quantile, from highest to lowest. In addition, the Split by partition option in the node provides an easy and convenient way to validate the model by displaying not only the results of the model using the training data, but in a separate chart, showing how well it performed with the testing or holdout data. Of course, this assumes that you made use of the Partition Node to develop the model. Otherwise, this option will be ignored. Outcomes are handled by defining a specific value or range of values as a hit. Hits usually indicate success of some sort (such as a sale to a customer) or an event of interest (such as someone given credit being a good credit risk). Flag output fields are straightforward; by default, hits correspond to true values. For Set output fields, by default the first value in the set defines a hit. For the churn data, the first value for the CHURNED field is Current. To specify a different value as the hit value, use the Options tab of the Evaluation node to specify the target value in the User defined hit group. There are five types of evaluation charts, each of which emphasizes a


Predictive Modeling With Clementine different evaluation criterion. Here we discuss Gains and Lift charts. For information about the others, which include Profit and ROI charts, see the Clementine 11.1Nodes Reference. Gains are defined as the proportion of total hits that occurs in each quantile. We will examine the gains when the data are ordered from those most likely to those least likely to be in the current category (based on the confidence of the model prediction).

Place an Evaluation node from the Graphs palette near the generated Neural Net model named CHURNED

Connect the generated Neural Net model named CHURNED to the Evaluation node

Figure 2.17 Stream with Evaluation Node Connected to a Generated Model Node

Double-click the Evaluation node


Predictive Modeling With Clementine Figure 2.18 Evaluation Node Dialog Box

The Chart type option supports five chart types with Gains chart being the default. If Profit or ROI chart type is selected, then the appropriate options (cost, revenue and record weight values) become active so information can be entered. The charts are cumulative by default (see Cumulative plot check box), which is helpful in evaluating such business questions as “how will we do if we make the offer to the top X% of the prospects?” The granularity of the chart (number of points plotted) is controlled by the Plot drop-down list and the Percentiles choice will calculate 100 values (one for each percentile from 1 to 100). For small data files or business situations in which you can only contact customers in large blocks (say some number of groups, each representing 5% of customers, will be contacted through direct mail), the plot granularity might be decreased (to deciles (10 equal-sized groups) or vingtiles (20 equal-sized groups)). A baseline is quite useful since it indicates what the business outcome value (here gains) would be if the model predicted at the chance level. The Include best line option will add a line corresponding to a perfect prediction model, representing the theoretically best possible result applied to the data where hits = 100% of the cases.

Click the Include best line checkbox The Split by partition option provides an opportunity to test the model against the unseen data that was held out by the Partition node. If checked, an evaluation chart will be displayed for both the Training and Testing samples. We will accept the default option to split by partition.

Click Options tab


Predictive Modeling With Clementine To change the definition of a hit, check the User defined hit check box and then enter the condition that defines a hit in the Condition box. For example, if we wanted the evaluation chart to be based on the Vol (voluntary leavers) category, the condition would be @TARGET = "Vol", where @TARGET represents the target fields from any models in the stream. The Expression Builder can be used to build the expression defining a hit. This tab also allows users to define how scores are calculated (User defined score), which determines how the records are ordered in Evaluation charts. Typically scores are based on functions involving the predicted value and confidence. Figure 2.19 Evaluation Node Options Tab

The Include business rule option allows the Evaluation chart to be based only on records that conform to the business rule condition. So if you wanted to see how a model(s) performs for single males, the business rule could be STATUS = "S" and SEX ="M". The model evaluation results used to produce the evaluation chart can also be exported to a file (Export results to file option).

Click Execute


Predictive Modeling With Clementine Figure 2.20 Gains Chart (Cumulative) with Current as Target

The vertical axis of the gains chart is the cumulative percentage of the hits, while the horizontal axis represents the ordered (by model prediction and confidence) percentile groups. The diagonal line presents the base rate, that is, what we expect if the model is predicting the outcome at the chance level. The upper line (labeled Best) represents results if a perfect model were applied to the data, and the middle line (labeled $N-CHURNED) displays the model results. The three lines connect at the extreme [(0, 0) and (100, 100)] points. This is because if either no records or all records are considered, the percentage of hits for the base rate, best model, and actual model are identical. The advantage of the model is reflected in the degree to which the model-based line exceeds the base-rate line for intermediate values in the plot and the area for model improvement is the discrepancy between the model line and the Best (perfect model) line. If the model line is steep for early percentiles, relative to the base rate, then the hits tend to concentrate in those percentile groups of data. At the practical level, this would mean for our data that many of the current customers could be found within a small portion of the ordered sample. You can create bands on an Evaluation chart and generate a Select or Derive node for a band of business interest. Click on Edit…Enable Interaction The fact that the evaluation charts for both the Training and Testing data are so strikingly similar strongly suggests that the model will work well with unseen data. An examination of each of the evaluation charts reveals that across percentiles 1 through 40, the distance between the model and baseline lines grows (indicating a concentration of current customers). If we look at the 40th percentile value (horizontal axis) for the Training data, we see that under the base rate we expect to find 40% of the hits (current customers) in the first 40 percentiles (40%) of the sample, but the


Predictive Modeling With Clementine model produces over 70% of the hits in the first 40 percentiles of the model-ordered sample. This percentage can be displayed by placing your cursor on the line at the 40th percentile. The steeper the early part of the plot, the more successful is the model in predicting the target outcome. Notice that the line representing a perfect model (Best) continues with a steep increase between the 10th and 40th percentiles, while the results from the actual model flatten. Figure 2.21 Hit Rate for Bad Losses at the 40th Percentile

For the remainder of the percentiles (50 through 100), the distance between the model and base rate narrows, indicating that these last model-based percentile groups contain a relatively small (lower than the base rate) proportion of current customers. The Gains chart provides a way of visually evaluating how the model will do in predicting a specified outcome. The lift chart is another way of representing this information graphically. It plots a ratio of the percentage of records in each quantile that are hits divided by the overall percentage of hits in the training data. Thus the relative advantage of the model is expressed as a ratio to the base rate.

Close the Evaluation chart window Double-click the Evaluation node named $N-CHURNED Click the Plot tab Click the Lift Chart Type option Click Execute


Predictive Modeling With Clementine The lift chart is shown in Figure 2.22. The lift value at the 20th Percentile is 1.848 (recall this is cumulative lift), providing another measure of the relative advantage of the model over the base rate. Gains charts and lift charts are very helpful in marketing and direct mail applications, since they provide evaluations of how well the campaign would do if it were directed to the top X% of prospects, as scored by the model. Although, the performance on the Testing data is slightly less (1.647), it is still highly acceptable, adding further confidence to the model. Figure 2.22 Lift Chart (Cumulative) with Current Customers as Target

We have established where the model is making correct and incorrect predictions and evaluated the model graphically. But how is the model making its predictions? In the next section we will examine a couple of methods that may help us to begin to understand the reasoning behind the predictions.

Close the Evaluation chart window

2.6 Understanding the Reasoning behind the Predictions One method of trying to understand how a neural network is making its predictions is to apply an alternative machine learning technique, such as rule induction, to model the neural network predictions. We will introduce this approach in a later chapter. In the meantime we will use some other methods to understand the relationships between the predicted values and the fields used as inputs.


Predictive Modeling With Clementine Symbolic Input with Symbolic Output Based on the sensitivity analysis, a symbolic input of moderate importance is SEX. Since it and the output field are symbolic we could use a distribution plot with a symbolic overlay, to understand how sex relates to the CHURNED predictions.

Place a Distribution node from the Graphs palette near the Select node for the Training Connect the Select node to the Distribution node Double-click the Distribution node Select SEX from the Fields: list Select $N-CHURNED as the Color Overlay field Click the Normalize by color option Click Execute

The Normalize by color option creates a bar chart with each bar the same length. This helps to compare the proportions in each overlay category. Figure 2.23 Distribution Plot Relating Sex and Predicted Churned ($N-CHURNED)

The chart illustrates that the model is predicting that the majority of females are voluntary leavers, while the bulk of males were predicted to remain current customers.

Close the Distribution plot window We next look at a histogram plot with an overlay.

Numeric Input with Symbolic Output The most important numeric input for this model is International. Since the output field is symbolic, we will use a histogram of International with the predicted value as an overlay to try to understand how the network is associating international minutes used with CHURNED.

Place a Histogram node from the Graphs palette near the Select node for the Training sample

Connect the Select node named to the Histogram node Double-click the Histogram node Click International in the Field: list Select $N-CHURNED in the Overlay Color field list



Click on the Options tab Click on Normalize by Color Click Execute

We need to normalize by color because there are so few people in the file who did a substantial amount of international calling. Figure 2.24 Histogram with Overlay of Predicted Churned ($N-CHURNED)

Here we see that the neural network is predicting that customers who spend a great deal of time on international calling are far more likely to voluntarily leave the company than persons who rarely do international calling. Now, let’s look at how Long Distance calling affected the predictions.

Close the Histogram plot window

Double-click the Histogram node Click the Plots tab Click LONGDIST in the Field: list Click Execute


Predictive Modeling With Clementine Figure 2.25 Histogram of LONGDIST with $N-CHURNED as the Overlay

Here the only clear pattern we see is that Involuntary Leavers tend to be people who do little or no long distance calling. In contrast, it appears that amount of long distance calling was not as much an issue when it came to predicting whether or not a person would remain a current customer or voluntarily choose to leave.

Close the Histogram plot window

Note: Use of Data Audit Node We explored the relationship between just three input fields (International, LONGDIST, and SEX) and the prediction from the neural net ($N-CHURNED), and used Distribution and Histogram nodes to create the plots. If more inputs were to be viewed in this way, a better approach would be to use the Data Audit node because overlay plots could easily be produced for multiple input fields, and a more detailed plot could be created by double-clicking on it in the Data Audit output window.

2.7 Saving the Stream To save this stream for later work:

Click File…Save Stream As Move to the c:\Train\ClemPredModel directory (if necessary) Type NeuralNet in the File Name: text box Click Save


Predictive Modeling With Clementine 2.8 Model Summary In summary, we appear to have built a neural network that is reasonably good at predicting the three different CHURNED groups. The overall accuracy was about 82% with the Training data, and 78% with the Testing data. Focusing on the Testing or unseen data, the model is most accurate at predicting the Involuntary Leaver group (100%), but somewhat less successful predicting the Current Customers and Voluntary Leavers, each about 76%. Considering that the model was correct three-quarters of the time even in the case of these latter two groups, it is certainly within the realm of possibility that the model may be considered a success. Of course, this would depend on whether these accuracy rates met or exceeded the minimum requirements defined at the beginning of the data mining project. In terms of how predictors relate to the model, the most important factors in making its predictions are International, Sex, and Longdist. The network appears to associate females with the Voluntary Leaver group and predicts that males will remain Current Customers. The model also tends to predict that the people who are most likely to be dropped by the company (Involuntary Leavers) are those who do little or no long distance calling.

2.9 Advanced Neural Network Techniques One of the advantages of Clementine is the ease with which you are able to build a neural network without in fact knowing too much about how the algorithms work. Earlier in this chapter we used the default settings and “Quick” method. The resulting network predicts with a reasonable degree of accuracy and can be used on new data quite easily. However, a more advanced user often needs to be able to further improve the performance of a trained network. In this section, we use some of the more advanced techniques and expert options within Clementine to improve the models that you build. We will begin by briefly describing the two different types of training methods: a Multi-Layer Perceptron (MLP) and a Radial Basis Function Network (RBFN) model. The Neural Net node contains four different MLP algorithms and one RBFN. The expert options for each of these methods will be described, including various topologies and learning rates. Finally we will discuss the use of sensitivity analysis and provide information on how to prevent over-training.

2.10 Training Methods As we said earlier, a neural network consists of a number of processing elements, often referred to as “neurons,” that are arranged in layers. Each neuron is linked to every neuron in the previous layer by connections that have strengths or weights attached to them. The learning algorithm controls the adaptation of these weights to the data; this gives the system the capability to learn by example and generalize for new situations. The main consideration when building a network is to locate the best, or global solution, within a domain; however, the domain may contain a number of sub-optimal solutions. The global solution can be thought of as the model that produces the least possible error when records are passed through it. To understand the concept of global error, imagine a graph created by plotting the hidden weights within the neural network against the error produced. Figure 2.26 gives a simple representation of


Predictive Modeling With Clementine such a graph. With any complex problem there may be a large number of feasible solutions, thus the graph contains a number of sub-optimal solutions or local minima (the “valleys” in the plot). The trick to training a successful network is to locate the overall minimum or global solution (the lowest point), and not to get “stuck” in one of the local minima or sub-optimal solutions. Figure 2.26 Representation of the Error Domain Showing Local and Global Minima

There are many different types of supervised neural networks (that is neural networks that require both inputs and an output field). However, within the world of data mining, two are most frequently used. These are the Multi-Layer Perceptron (MLP) and the Radial Basis Function Network (RBFN). In the following paragraphs we will describe the main differences between these types of networks and describe their advantages and disadvantages.

2.11 The Multi-Layer Perceptron The Multi-Layer Perceptron is currently the most popular type of neural network. The MLP network consists of layers of neurons, with each neuron linked to all neurons in the previous layer by connections of varying weights. All MLP networks consist of an input layer, an output layer and at least one hidden layer. The hidden layer is required to perform non-linear mappings. The number of neurons within the system is directly related to the complexity of the problem, and although a multi-layered topology is feasible, in practice there is rarely a need to have more than one hidden layer. To visualize how a MLP works, imagine a problem where you wish to predict an outcome field consisting of two groups, using only two input fields. Figure 2.27 shows a graph of the two input fields plotted against one another, overlaid with the output field. Using a non-linear combination of the inputs, the MLP fits an open curve between the two classes.


Predictive Modeling With Clementine Figure 2.27 Decision Surface Created Using the Multi-Layer Perceptron

The advantages of using a MLP are:

• It is effective on a wide range of problems • It is capable of generalizing well • If the data are not clustered in terms of their input fields, it will classify examples in the

extreme regions • It is currently the most commonly used type of network and there is much literature

discussing its applications The disadvantages of using a MLP are:

• It can take a great deal of time to train • It does not guarantee finding the best global solution

Within Clementine, there are four available MLP learning algorithms: Quick, Dynamic, Multiple and Prune. In addition, an Exhaustive Prune option (Prune with a large preset topology) is available. Choosing an appropriate algorithm can involve a trade-off between computing time and accuracy (increased computing time supports a more extensive search for the global solution). We will discuss the four different algorithms and their settings in later paragraphs.

2.12 The Radial Basis Function The Radial Basis Function (RBF) is a more recent type of network and is responsive to local regions within the space defined by the input fields. Figure 2.28 shows a graphical representation of how a RBF fits a number of basis functions to the problem described in the previous section. The RBF can be thought of performing a type of clustering within the input space, encircling individual clusters of data by a number of basis functions. If a data point falls within the region of activation of a particular basis function, then the neuron corresponding to that basis function responds most strongly. The concept of the RBF is extremely simple; however the selection of the centers of each basis function is where difficulties arise.


Predictive Modeling With Clementine Figure 2.28 Operation of a Radial Basis Function

The advantages of using a RBF network are:

• It is quicker to train than a MLP • It can model data that are clustered within the input space.

The disadvantages of using a RBF network are:

• It is difficult to determine the optimal position of the function centers • The resulting network often has a poor ability to represent the global properties of the

data. Within Clementine there is only one available RBF algorithm, which uses the k-means clustering algorithm to determine the number and location of the centers in the input space. We discuss the settings of this algorithm in greater detail in the following sections.

2.13 Expert Options In the following sections we consider the various neural net algorithms, together with their specific expert options available within the Neural Net node. First, however, we introduce a number of parameters that are common to most of the algorithms, specifically alpha, eta and persistence. It should be noted that using these parameters requires some knowledge of neural networks.

Alpha Alpha refers to the momentum used in updating the weights when trying to locate the global solution. It tends to keep the weight changes moving in a consistent direction and can reduce the training time. Each update includes a factor of alpha times the previous update. Alpha ranges between 0 and 1, and its default value is 0.9.


Predictive Modeling With Clementine Eta Eta refers to the learning rate and can be thought of as how much of an adjustment can be made at each update. Within the expert options the Initial Eta is the starting value of eta. This is then exponentially decayed to Low Eta, at which point it is set back to High Eta and decayed back to Low Eta. The decay takes Eta Decay cycles to go from High Eta to Low Eta. Figure 2.29 illustrates the exponential decay of eta. Figure 2.29 Exponential Decay of Eta

When deciding upon settings of both alpha and eta it is often useful to think of a simple analogy. Imagine a ball bearing (the current network) rolling down a hill, trying to find the lowest point (global solution). Alpha is analogous to the size of the ball bearing and eta is the gradient of the hill. When choosing a network, we want to reach the overall minimum, not a sub-optimal solution; hence, we require a ball bearing that has enough momentum to move in and out of local minima (alpha) and a space that contains local hills with small gradients. Figures 2.30 and 2.31 illustrate this analogy: Figure 2.30 Network Becomes Stuck in a Local Minimum and Provides a Sub-Optimal Solution


Predictive Modeling With Clementine Figure 2.31 Network Can Move In and Out of Local Minima and Locate the Globally Optimal Solution

Persistence Persistence refers to the number of cycles for which the network will train without improvement in default stopping mode. Neural networks must pass the data through their neural topology many, many times before finding a stable and viable solution. When deciding at what values to set Alpha, Eta and Persistence, the feedback graph can be used to help make your decisions. Figure 2.32 illustrates this. The first feedback graph shows an ideal feedback graph. Recommended values are typically:

• alpha = 0.9 • eta = 0.01 to 0.3 • persistence = 200 • eta decay cycles = 30

If a solution is reached extremely quickly and the graph then reaches a plateau, as shown in the second graph in Figure 2.32, a reduction in the Initial Eta can help. If the accuracy increases and then suddenly drops without reaching a plateau, as illustrated in the third graph in Figure 2.32, the Eta Decay cycles should be increased. Finally, if the training terminates too quickly, as shown in the last graph in Figure 2.32, the Persistence should be increased.


Predictive Modeling With Clementine Figure 2.32 Various Feedback Graphs and the Possible Solution

We will now go on to discuss each of the algorithms available within Clementine and their expert options.

2.14 Available Algorithms The Neural Net node contains four MLP algorithms (Quick, Dynamic, Multiple, Prune) and one RBF algorithm, RBFN. In the following paragraphs we will briefly describe the five different algorithms available within the Neural Net node and the set of expert options specific to each training method.

Quick The Quick method (the default) creates a network that contains one hidden layer. The number of neurons within the hidden layer is determined according to several factors relating to the number and type of fields used in the analysis. If it is not already open, open the NeuralNet..str.

Click File…Open Stream and move to the c:\Train\ClemPredModel directory Double-click on NeuralNet.str


Predictive Modeling With Clementine Figure 2.33 NeuralNet.str Clementine Stream

Edit the Neural Net node (named CHURNED) in the upper stream Click Quick on the Method drop-down list (if necessary) Click the Expert tab Click the Expert Mode option button


Predictive Modeling With Clementine Figure 2.34 Expert Options for the Neural Net Node’s Quick Training Method

You can select whether to use up to three Hidden Layers and set the number of neurons within each of these layers. The default Persistence is set to 200 and can be altered if required. The learning rates of Alpha and Eta can also be changed. Run the Neural Net model with the Quick method.

Click Execute We will use this model later in the chapter.

Dynamic Dynamic training uses a “dynamically growing” network that starts with two hidden layers of two neurons each and begins to grow by adding one neuron to each layer. The network monitors the training and looks for over-training; lack of improvement triggers growth of the network. The network will continue to grow until adding a neuron gives no benefit for a number of growing attempts. This option is slow but often yields good results. The Dynamic method has no Expert options.

Multiple The Multiple method creates a number of networks, each with a different topology. Some contain one hidden layer; others contain two, all with varying numbers of hidden neurons. The networks are trained in pseudo-parallel, which causes this method to be extremely slow to train, but can yield good results.



Click the Model tab and click Multiple in the Method drop-down list Click the Expert tab

Figure 2.35 Expert Options for the Neural Net Node’s Multiple Training Method

The Topologies box is a text field in which you may specify a set of topologies for the networks to be trained. The topologies refer only to the hidden layers (because the input and output layers are fixed by the fields in the model). Semi-colons separate the network definitions: Network1; Network2; Network3 Commas separate the hidden layer structure: Layer1, Layer2, Layer3 Spaces separate up to three numbers that represent the number of neurons in the layers: n m inc where:

n alone represents the number of neurons in the hidden layer n m represents alternative sizes for the hidden layer, one for each integer between n and m inclusive n m inc represents alternative sizes for the hidden layer, one for each integer between n and m in jumps of inc.


Predictive Modeling With Clementine For example, the topologies setting (2 20 3; 2 27 5, 2 22 4) represents a set of networks with one hidden layer containing 2, 5, 8, 11, 14, 17 and 20 neurons, and then a set of networks with two hidden layers using all combinations of 2, 7, 12, 17, 22, or 27 neurons in the first layer and 2, 6, 10, 14, 18 or 22 neurons in the second hidden layer. The Discard Non-Pyramids: option, when checked, ensures that no non-pyramid networks are produced. A network is a pyramid if each layer contains a smaller number of neurons than the preceding layer. Non-pyramids have been found not to train as well as pyramids. The default Persistence is set to 200 and can be altered if required. The momentum (Alpha) and learning rates (Initial Eta, High Eta, and Low Eta) can also be changed.

Prune The Prune method begins with a large one or two hidden-layer network (large meaning all predictors and many hidden neurons). Training is initially done in the same manner as Quick. A sensitivity analysis is then performed on the hidden neurons, and the weakest (the proportion based on the Hidden Rate factor—default .15 or 15%) hidden neurons are removed from the network. This process of training and removing is repeated until there has been no improvement in the network (for a Hidden Persistence number of loops). Once this has been done, training is performed, a sensitivity analysis performed on the input neurons, and the least important input neurons (proportion based on the Input Rate—default .15 or 15%) are removed. This loop continues until the network no longer improves (for an Input Persistence number of loops). The larger loop of pruning the hidden layer, then pruning the input layer, continues until there has been no improvement (for an Overall Persistence number of loops). The final network may actually use fewer predictor fields than originally supplied. The Prune method is generally the best of the four methods, but is very time consuming.

Click the Model tab and click Prune in the Method drop-down list Click the Expert tab


Predictive Modeling With Clementine Figure 2.36 Expert Options for the Neural Net Node’s Prune Training Method

You may specify the size and number of Hidden Layers within the starting network (usually set slightly larger than expected). The Hidden rate is the factor by which the number of neurons within the hidden layers is reduced by each pruning. The Input rate is the factor by which the number of neurons within the input layer is reduced by each pruning. The Hidden persistence and Input persistence are the number of prunes to perform to the hidden and input layer, respectively, without improvement. The Persistence is the number of cycles for which the network will train without improvement, before it attempts to prune one of the layers. The Overall persistence is the number of times the network will pass through the prune hidden/prune input loop without improvement in Default stopping mode. As with the other algorithms, you can alter the persistence, momentum (alpha) and the learning rates (eta).

Exhaustive Prune The Exhaustive Prune Training Method invokes Prune settings that are designed to produce a more exhaustive examination of networks than is done by the default Prune method. When Exhaustive Prune is chosen, the Hidden rate and Input rate are reduced (from .15) to .02 and .01, respectively. This means that fewer neurons and fields are removed at each stage, so a larger number of topologies can be examined. The Persistence values are also increased (Hidden persistence and Input Persistence are 40 and Persistence is 1000), so each topology can receive more training. In addition, the initial topology is a large two-hidden layer network. The learning rates are not changed.


Predictive Modeling With Clementine Thus, the Exhaustive Prune method provides an easy way to request that a more complete examination of networks is done within the Prune method. Since the Exhaustive Prune option invokes the Prune method with specific Expert settings, no Expert options are available when Exhaustive Prune is selected as the method. If you wish to run an Exhaustive Prune model and change the learning rates, you can use the Expert options under the Prune method to manually reduce the Hidden rate and Input rate, increase the Persistence values, and then change the learning rates. The Exhaustive Prune option will be slower than the Prune method, but since it examines more networks, it may produce a better model.

RBFN The RBFN method creates a Radial Basis Function Network and works by initially creating a K-means clustering model that provides the “centers” for the hidden layer. The output layer is then trained as a “Single Layer Perceptron” using the Least Mean Squares (LMS) method. When the Expert mode is selected, the options shown below become available.

Click the Model tab and click RBFN in the Method drop-down list Click the Expert tab

Figure 2.37 Expert Options for the Neural Net Node’s RBFN Training Method

You may specify the size of the “hidden layer” by changing the number of RBF clusters. The Persistence, Alpha and Eta can be altered if required, although the eta parameter can be calculated automatically on the basis of learning performance in the first two iterations.


Predictive Modeling With Clementine During RBF training, a data record (or pattern) is most strongly represented in the cluster(s) nearest it. In other words, its activation value is highest for nearby clusters. When the distance between a record and a cluster center is evaluated, it is divided by a width or spread parameter (named sigma), which is equivalent to the role of the standard deviation in a normal distribution (the normal distribution is close in form to a commonly-used activation function). In Clementine, the sigma (or spread) value for a cluster is equal to the average distance from that cluster to the two nearest clusters. These sigma values are determined by Clementine, but you can effectively increase or decrease them by modifying the RBF overlapping factor. By increasing this multiplicative factor above 1, records further from a cluster center will be reflected in that cluster during training and so clusters will tend to overlap (a record can contribute to multiple clusters). As the RFB overlapping factor is decreased (from 1 toward 0), a cluster tends to be represented in training by only those records tightly grouped around it. Thus low RBF overlapping factors lead to tight, distinct clusters, while large RBF overlapping factors lead to large, dispersed, overlapping clusters being used in training the network.

2.15 Which Method, When? Due to the random nature of neural networks, the models built using each of the different algorithms will tend to perform with varying degrees of accuracy. When building neural networks it is sensible to try a number of models and either choose the one with the best overall performance, or, use all models to gain a majority prediction. Deciding which algorithm to use is often a personal choice, however there are a number of guidelines:

• If time is limited use the default Quick method, which will give a rough indication of the results in a short amount of time.

• If accuracy is the main concern and time is unlimited try a network built using the Prune algorithm, perhaps using the Exhaustive Prune option

• If you think a number of the input fields may not be necessary, the Prune / Exhaustive Prune networks will remove weaker neurons and hence discard unnecessary inputs.

• If you are skeptical about finding a global solution try the RBFN algorithm as this approach is guaranteed to produce a function that fits all the data points and has no local minima situations.

• The Multiple method is rarely used.

2.16 Sensitivity Analysis Once the network has been built, if the Sensitivity Analysis option has been checked, the resulting network is examined and the sensitivity of each input is calculated. Sensitivity is measured, for each input field, as a score between 0.0 and 1.0. It is calculated by analyzing the effect of changing a particular input while holding the other inputs constant. The greater the effect on the output of changing an input, the more important it is. As a rough rule of thumb, a sensitivity score of 0.0 indicates that the field is unimportant and a score over 0.3 indicates substantial importance of the field. The sensitivity analysis may be viewed by browsing the generated model from the Models manager or editing it in the Stream Canvas. These results can be used to better understand how the network makes predictions, to select the top inputs and use those to train a new network (although this is done automatically by selecting either the Prune or Exhaustive Prune methods), or to check that the ranking of the input fields agree with positions in decision trees (if alternative models have been calculated).


Predictive Modeling With Clementine 2.17 Prevention of Over-Training One final consideration when building a neural network is whether the resulting model is sufficiently general when making its predictions. If a model is over-trained the model will eventually “learn” the patterns within the training data set and the error attached to the model will approach zero. At this stage, given that data usually contains noise (error), the neural net model will have learned the noise characteristics, which will degrade the overall performance of the model on unseen data. In order to avoid the problem of over-training, a test data set should be used to monitor the training process. By checking the Prevent overtraining option within the Neural Net node, the data are split into two random subsets, known as “train” and “test” sets, whose sizes are determined by the Sample % value. While training, the train data is used to build a network and then the test data is used to measure the error of the model on different unseen data. This cycle, of passing the test data through the model built on the train data, is repeated many times and the model with the lowest overall error measured on the test data set is deemed the “best network.” At the end of training, it is this “best network” that appears in the Models Manager. Note that practitioners should use a separate validation sample, not used at all during training or testing, in order to estimate the model error.

2.18 Missing Values in Neural Networks The Neural Net node requires valid values for all input fields. If there are missing values for any inputs, these values are converted to valid values using the rules listed in the table below. This is similar to what is done by the last Type node or Types tab, when a field’s Check value is set to Coerce. However, the conversion rules differ for flag and set fields due to the way a neural network processes fields with these types. In the table below, the Illegal/Missing Value column includes more than just null values, white space (strings with no characters), and blanks (user-defined missing values). For example, if a set value not stored in the Values column of the last Type node (or Types tab of a source node) is encountered, it is considered illegal. Similarly, a value above the upper bound or below the lower bound of a range field would be considered illegal. Table 2.2 How the Neural Net Node Converts Missing and Illegal Values

Field Type Illegal/Missing Value Illegal Value converted to: Flag A missing value or any

unknown value .5 (midway between 0 and 1). A flag field is represented by the values 0 and 1 within the Neural Net node)

Sets A missing value or any unknown value

0 (for all the 0,1 input fields that the Neural Net node creates to represent an input field of type set)

Range A missing value or a value greater than upper bound

Replaced by upper bound

Range Less than lower bound Replaced by lower bound Range Non-numeric value or missing

value Midpoint value of range

The Neural Net node handles these value conversions automatically. It is therefore important to check the data in advance of running a Neural Net using the Quality and Data Audit nodes and to


Predictive Modeling With Clementine decide which records and fields should be passed to the neural network for modeling. Otherwise you run the risk of a model being built using data values supplied by these substitution rules. Also, you can take control of the substitution by using the facilities of Clementine to change missing values to valid values that you prefer, before using the Neural Net node.

Note: Accuracy for a Continuous Output Neuron For a nominal output variable, the accuracy is simply the percentage correct. It is worth noting that if the output field is continuous then accuracy within Clementine is defined as the average across all records of the following expression:

100* [1 – (abs[Target – Network Prediction]/ (Range of Target values))]

2.19 Exploring the Different Neural Network Options While the Quick method usually produces a reasonably accurate result, it is usually wise to try some of the other methods to see if they produce an even more accurate model. In addition, we may need to also adjust the learning rate, momentum, or persistence values if we suspect that we may have reached a sub-optimal solution. As was shown in Figure 2.32, the feedback graph may help us decide whether any of these parameters need to be changed. Unfortunately, because the data set we are using is relatively small, we were unable to view the feedback graph when we created the model with the Quick method. For larger data sets, this shouldn’t be a problem. To illustrate how the feedback graph can be used to refine the model, we will experiment with the Dynamic and Prune methods, both of which take considerably longer to run. We will save the other methods for you to run later during the hands-on exercises. At the same time, we will see whether either of these two other methods produces a more accurate model. Note that you will need to rename each model in the models palette, because otherwise it will be overwritten. We will also use the same seed value throughout these analyses in order to facilitate comparisons. First we will use the Dynamic method.

Right click on the generated model in the Models manager and select Rename and Annotate from the context menu

Click Custom and type Quick Edit the Neural Net node (named CHURNED) Click the Model tab and select Dynamic from the Method drop-down list Click Execute

The rapid rise in the line seems to suggest that the model reached a solution too quickly (see Figure 2.32, second graph). Unfortunately, because the Dynamic method has no expert options, it won’t be possible to reduce the Initial Eta, which is recommended in Figure 2.32. However, this alone doesn’t mean that the Dynamic method won’t be the best model with this data. As we will see, there are other factors to consider, such as overall accuracy, or the ability of the model to correctly predict a particular group of interest.


Predictive Modeling With Clementine Figure 2.38 Feedback graph using the Dynamic method

Now, let’s try running the Prune method. It should be noted that this method usually takes considerably longer than the other methods but is often more accurate.

Right click on the new generated model in the in the Models manager and select Rename and Annotate from the context menu

Click Custom and type Dynamic Edit the Neural Net node (named CHURNED) Click the Model tab and select Prune from the Method drop-down list Click Execute

It appears that the Prune feedback graph also suggests a sub-optimal solution. While, normally you would let the model run to completion, in the interest of time we will stop the execution and adjust the initial eta value. (In case you are interested, on my computer the model took 8 minutes and 9 seconds to complete, and the Estimated Accuracy was 85.795%. Of course, these timings will differ from computer to computer). Figure 2.39 Feedback graph using the Prune method

Click the Stop Execution button on the tool bar (it turns red during stream execution) Click Yes when you are asked whether you want to stop the execution Click No when you are asked whether you want to try generating a model Close the Message screen


Predictive Modeling With Clementine Now let’s rerun the model with a lower Initial Eta value.

Edit the Neural Net node (named CHURNED) Click the Expert tab Click Expert Mode Change the Initial Eta value to 0.2

Figure 2.40 Expert Prune Options After Reducing the Initial Eta value

Click Execute At quick glance, it appears that the pattern in the feedback graph is no different than when the Initial Eta was 0.3. We could continue decreasing the value, but we will leave that to you as an exercise. Instead, let’s examine the results of the model.

Right click on the new generated model in the in the Models manager and select Rename and Annotate from the context menu

Click Custom and type Prune Right Click again on the model and click Browse Expand the Relative Importance of Inputs folder


Predictive Modeling With Clementine Figure 2.41 Relative Importance of the fields using the Prune method

In this instance, reducing the Initial Eta resulted in a slightly less accurate model than with the defaults (85.511% vs. 85.795%). One other thing to notice is that Prune method used fewer predictors than the Quick Method. However, there is agreement between both models that Longdist, International, and Sex were the best predictors. By the way, although we did not display the sensitivity table for the Dynamic method, it also ranked these three fields as the top predictors. Now that we have three separate models, we can compare them to see which one is the best for our purposes. It should be noted that overall accuracy isn’t the only criteria with which to compare them. For example, if we were primarily interested in identifying Voluntary Leavers, we would want to use the model which did the best job predicting them. We will use an Analysis node to make our comparisons.

Move all the generated models to the Stream Canvas Connect them in sequence (Quick, Dynamic, Prune) to the Type node (as shown in

Figure 2.42) Attach an Analysis node from the Output palette to the last model


Predictive Modeling With Clementine Figure 2.42 Stream for Comparing the Three Models

Edit the Analysis Node Check Coincidence matrices (for symbolic targets) Click Execute

The output is shown in Figure 2.43. The results for each model are listed according to their order in the stream. Thus, N-CHURNED corresponds with the Quick method, N1-CHURNED with the Dynamic method and N2-CHURNED with the Prune method. We will focus on the Testing data, or the data that was not used to build the models. This will give us a good idea how well each model with work with new data. The model with the best overall accuracy was created by the Prune method (81.4%) while the Quick (78.4%) and Dynamic (74.4%) did slightly worse. However, as was mentioned earlier, it isn’t always the overall accuracy which is of the most interest. For example, if the primary focus was on identifying Current Customers or Voluntary Leavers, it may not necessarily be the most accurate model, although based on the coincidence matrices, in this case it was.


Predictive Modeling With Clementine Figure 2.43 Results of the Model Comparisons Using the Analysis Node




Exercise Data Information Sheets The exercises in this chapter is written around the data files charity.sav. The following section give details of the file. charity.sav comes from a charity and contains information on individuals who were mailed a promotion. The file contains details including whether the individuals responded to the campaign, their spending behavior with the charity and basic demographics such as age, gender and mosaic (demographic) group. The file contains the following fields: response Response to campaign orispend Pre-campaign expenditure orivisit Pre-campaign visits spendb Pre-campaign spend category visitb Pre-campaign visits category promspd Post-campaign expenditure promvis Post-campaign visits promspdb Post-campaign spend category promvisb Post-campaign visit category totvisit Total number of visits totspend Total spend forpcode Post Code mos 52 Mosaic Groups mosgroup Mosaic Bands title Title sex Gender yob Year of Birth age Age ageband Age Category In this session we will attempt to predict the field Response to campaign using a neural network.

1. Begin with a clear Stream canvas. Place an SPSS source node on the Stream canvas and connect it to the file charity.sav. Tell Clementine to use use variable and value labels.

2. Attach a Type and Table node in a stream to the source node. Execute the stream and

allow Clementine to automatically define the types of the fields.

3. Edit the Type node. Set all of the fields to direction NONE.

4. We will attempt to predict response to campaign using the fields listed below. Set the direction of all five of these fields to IN and the Response to campaign field to OUT.



Pre-campaign expenditure Pre-campaign visits Gender Age Mosaic Bands (which should be changed to type Set)

5. Attach a Neural Net node to the Type node. Execute the Neural Net node with the default

settings.

6. Once the model has finished training, browse the generated Net node within the Generated Models palette in the Manager. What is the predicted accuracy of the neural network? What were the most important fields within the network?

7. Place the generated Net node on the Stream canvas and connect the Type node to it.

Connect the generated Net node to a Matrix node and create a data matrix of actual response against predicted response. Which group is the model predicting well?

8. Use some of the methods introduced in the chapter, such as web plots and histograms (or

use the Data Audit node with an overlay field), to try to understand the reasoning behind the network’s predictions.

9. Try some of the other neural net algorithms to see if you can improve on the accuracy.

Also try modifying some of the parameters on the Expert tab to see if you can get a better result for the current model. Note, in the interests of time, you may have to stop execution on some of these methods other than Quick. However, you can still generate a sensitivity table even after you stop the execution.

Save a copy of the stream as Network.str.



Chapter 3: Decision Trees/Rule Induction Overview

• Introduce the features of the C5.0, CHAID, C&R Tree and QUEST nodes • Review the Interactive Trees feature • Understand how CHAID and C&R Tree model a numeric output

Objectives We will introduce the differences between the four types of decision tree/rule induction algorithms and detail the expert options available within their respective modeling nodes and when to use them. We will demonstrate the Interactive Trees feature with CHAID. We briefly explain how CHAID and C&R Tree can also be used to model a numeric output.

Data We will use the data set churn.txt, which we used in the Neural Nets chapters. This data file contains information on 1477 of the company’s customers who have at some time purchased a mobile phone. The customers fall into one of three groups: current customers, involuntary leavers and voluntary leavers. In this chapter, we will decision tree models to understand which factors influence into which of the three groups an individual falls. The file contains information including length of time spent on local, long distance and international calls, the type of billing scheme and a variety of basic demographics, such as age and gender of the customer. A second data set Insclaim.dat, used with the C&R Tree node, contains 293 records based on patient admissions to a hospital. All patients belong to a single diagnosis related group (DRG). Four fields (grouped severity of illness, age, length of stay, and insurance claim amount) are included. The goal is to build a predictive model for the insurance claim amount and use this model to identify outliers (patients with claim values far from what the model predicts), which might be instances of errors made in the claims. Such analyses can be performed for error or fraud detection in instances where audited data (for which the outcome: error/no error or fraud/no fraud) are not available.

3.1 Introduction Clementine contains four different algorithms for constructing a decision tree (more generally referred to as rule induction): C5.0, CHAID, QUEST, and C&R Tree (classification and regression trees). They are similar in that they can all construct a decision tree by recursively splitting data into subgroups defined by the predictor fields as they relate to the outcome. However, they differ in several important ways. We begin by reviewing a table that highlights some distinguishing features of the algorithms. Next, we will examine the various options for the algorithms in the context of predicting a symbolic output. Within each section we discuss when it is viable to use the expert options within these nodes.

Decision Trees/Rule Induction 3 - 1

Predictive Modeling With Clementine 3.2 Comparison of Decision Tree Models The table below lists some of the important differences between the decision tree/rule induction algorithms available within Clementine. Table 3.1 Some Key Differences Between the Four Decision Tree Models

Model Criterion

C5.0

CHAID

QUEST

C&R Tree

Split Type for Symbolic Predictors

Multiple Multiple1 Binary Binary

Continuous Target

No Yes No Yes

Continuous Predictors

Yes No2 Yes Yes

Criterion for Predictor Selection

Information measure

Chi-square F test for continuous

Statistical Impurity (dispersion) measure

Can Cases Missing Predictor Values be Used?

Yes, uses fractionalization

Yes, missing becomes a category

Yes, uses surrogates


Priors No No Yes Yes Pruning Criterion

Upper limit on predicted error

Stops rather than overfit

Cost-complexity pruning

Cost-complexity pruning

Build Trees Interactively

No Yes Yes Yes

Supports Boosting

Yes No No No

1SPSS has extended the logic of the CHAID approach to accommodate ordinal and continuous target variables. 2Continuous predictors are binned into ordinal variables containing by default approximately equal sized categories. Note: C&R Tree and QUEST produce binary splits (two branch splits) when growing the tree, while C5.0 and CHAID can produce more than two subgroups when splitting occurs. However, if we had a predictor of type set with four categories, each of which were distinct with relation to the outcome field, C&R Tree and QUEST could perform successive binary splits on this field. This would produce a result equivalent to a multiple split at a single node, but requires additional tree levels. All methods can handle predictors and outcomes that are of type flag and set. CHAID and C&R Tree can use a continuous target or outcome field (of type range), while all but CHAID can use a continuous predictor (although see footnote 2). The trees that each method grows will not necessarily be identical because the methods use very different criteria for selecting a predictor. CHAID and QUEST use more standard statistical methods, while C5.0 and C&R Tree use non-statistical measures, as explained below.


Predictive Modeling With Clementine Missing (blank) values are handled in three different ways. C&R Tree and QUEST use the substitute (surrogate) predictor field whose split is most strongly associated with that of the original predictor to direct a case with a missing value to one of the split groups during tree building. C5.0 splits a case in proportion to the distribution of the predictor field and passes a weighted portion of the case down each tree branch. CHAID uses all the missing values as an additional category in model building. Three of the four methods prune trees after growing them quite large, while CHAID instead stops before a tree gets too large. For all these reasons, you should not expect the four algorithms to produce identical trees for the same data. You should expect that important predictors would be included in trees built by any algorithm. Those interested in more detail concerning the algorithms see the Clementine 11.1 Algorithms Guide. Also, you might consider C4.5: Programs for Machine Learning (Morgan Kauffman, 1993) by Ross Quinlan, which details the predecessor to C5.0; Classification and Regression Trees (Wadsworth, 1984) by Breiman, Friedman, Olshen and Stone, who developed CART (Classification and Regression Tree) analysis; the article by Loh and Shih (1997, “Split Selection Methods for classification trees,” Statistica Sinica, 7: 815-840) that details the QUEST method; and for a description of CHAID, “The CHAID Approach to Segmentation Modeling: CHI-squared Automatic Interaction Detection,” Chapter 4 in Richard Bagozzi, Advanced Methods of Marketing Research (Blackwell, 1994).

3.3 Using the C5.0 Model We will use the C5.0 node to create a rule induction model. It contains the rule induction model in either decision tree or rule set format. By default, the C5.0 node is labeled with the name of the output field. Like the generated Neural Net model, the C5.0 model can be browsed and predictions can be made by passing new data through it in the Stream Canvas. As with the Neural Net node, the C5.0 node must appear in a stream containing fully instantiated types (either in a Type node or the Types tab in a source node). Within the Type node or Types tab, the field to be predicted (or explained) must have direction OUT or it must be specified in the Fields tab of the C5.0 node. All fields to be used as predictors must have their direction set to IN (in Types tab or Type node) or be specified in the Fields tab. Any field not to be used in the modeling must have its direction set to NONE. Any field with direction BOTH will be ignored by C5.0. Rather than rebuild the source and Type nodes, we use those created earlier and saved in a stream file. The C5.0 model node will use the same direction settings in the Type node as were used for modeling with neural networks, which are also appropriate for rule induction.

Click File…Open Stream, and then move to the c:\Train\ClemPredModel directory Double-click NeuralNet.str (alternatively, open Backup_NeuralNet.str) Delete all nodes except the Var. File node (named churn.txt), the Partition node and

the Type node (right-click on a node, then click Delete, or click on a node and press the Delete key)

Place the C5.0 node from the Modeling palette to the upper right of the Type node in the Stream Canvas

Connect the Type node to the C5.0 node (as shown in Figure 3.1)


Predictive Modeling With Clementine The name of the C5.0 node should immediately change to CHURNED. Figure 3.1 C5.0 Modeling Node Added to Stream

Double-click the C5.0 node

Figure 3.2 C5.0 Node Model Tab

The Model name option allows you to set the name for both the C5.0 and resulting C5.0 rule nodes. The form (decision tree or rule set, both will be discussed) of the resulting model is selected using the Output type: option.


Predictive Modeling With Clementine The Use partitioned data option is checked so that the C5.0 node will make use of the Partition field created by the Partition node earlier in the stream. Whenever this option is checked, only the cases the Partition node assigned to the Training sample will be used to build the model; the rest of the cases will be held out for Testing and/or Validation purposes. If unchecked, the field will be ignored and the model will be trained on all the data. The Cross-validate option provides a way of validating the accuracy of C5.0 models when there are too few records in the data to permit a separate holdout sample. It does this by partitioning the data into N equal-sized subgroups and fits N models. Each model uses (N-1) of the subgroups for training, then applies the resulting model to the remaining subgroup and records the accuracy. Accuracy figures are pooled over the N holdout subgroups and this summary statistic estimates model accuracy applied to new data. Since N models are fit, N-fold validation is more resource intensive and reports the accuracy statistic, but does not present the N decision trees or rule sets. By default N, the number of folds, is set to 10. For a predictor field that has been defined as type set, C5.0 will normally form one branch per value in the set. However, by checking the Group symbolic values check box, the algorithm can be set so that it finds sensible groupings of the values within the field, thus reducing the number of rules. This is often desirable. For example, instead of having one rule per region of the country, group symbolic values may produce a rule such as: Region [South, Midwest] … Region [Northeast, West] … Once trained, C5.0 builds one decision tree or rule set that can be used for predictions. However, it can also be instructed to build a number of alternative models for the same data by selecting the Boosting option. Under this option, when it makes a prediction it consults each of the alternative models before making a decision. This can often provide more accurate prediction, but takes longer to train. Also the resulting model is a set of decision tree predictions and the outcome is determined by voting, which is not simple to interpret. The algorithm can be set to favor either Accuracy on the training data (the default) or Generality to other data. In our example, we favor a model that is expected to better generalize to other data and so we select Generality.

Click Generality option button C5.0 will automatically handle errors (noise) within the data and, if known, you can inform Clementine of the expected proportion of noisy or erroneous data. This option is rarely used. As with all of the modeling nodes, after selecting the Expert option or tab, more advanced settings are available. In this course, we will discuss the Expert options briefly. The reader is referred to the Clementine 11.1 Node Reference for more information on these settings.

Click the Expert option button


Predictive Modeling With Clementine Figure 3.3 C5.0 Node Model Tab Expert Options

By default, C5.0 will produce splits if at least two of the resulting branches have at least two data records each. For large data sets you may want to increase this value to reduce the likelihood of rules that apply to very few records. To do so, increase the value in the Minimum records per child branch box.

Click the Simple Mode option button, and then click Execute A C5.0 Rule model, labeled with the predicted field (CHURNED), will appear in the Models palette of the Manager.

3.4 Browsing the Model Once the C5.0 Rule node is in the Models palette, the model can be browsed.

Right-click the C5.0 node named CHURNED in the Models palette, then click Browse


Predictive Modeling With Clementine Figure 3.4 Browsing the C5.0 Rule Node

The results are in the form of a decision tree and not all branches are visible. Only the beginning of the tree is shown. According to what we see of the tree so far, LOCAL is the first split in the tree. Further we see that if LOCAL <= 4.976 the Mode value for CHURNED is InVol and if LOCAL > 4.976 the Mode value is Current. The Mode lists the modal (most frequent) output value for the branch. The mode will be the predicted value, unless there are other fields that need to be taken into account within that branch to make a prediction. In this instance, no predictions of CHURNED are visible. To view the predictions we need to further unfold the tree. To unfold the branch LOCAL > 4.976, just click the expand button.

Click to unfold the branch LOCAL > 4.976


Predictive Modeling With Clementine Figure 3.5 Unfolding a Branch

SEX is the next split field. Now we see that SEX is the best predictor for persons who spend more than 4.976 minutes on local calls. The Mode value for Males is Current and for Females is Vol. However, at this point we still cannot make any predictions for Sex because there is a symbol to the left of each value which means that other fields need to be taken into account before we can make a prediction. Once again we can unfold each separate branch to see the rest of the tree, but we will take a shortcut:

Click the All button in the Toolbar


Predictive Modeling With Clementine Figure 3.6 Fully Unfolded Tree

We can see several nodes usually referred to as terminal nodes that cannot be refined any further. In these instances, the mode is the prediction. For example, if we are interested in the Current Customer group, one group we would predict to remain customers are persons where Local > 4.976, Sex = M and International <= 0.905. To get an idea about the number and percentage of records within such branches we ask for more details.

Click Show or hide instance and confidence figures in the toolbar


Predictive Modeling With Clementine Figure 3.7 Instance and Confidence Figures Displayed

Branch predicting Current

The incidence tells us that there are 256 persons who met those criteria. The confidence figure for this set of individuals is .956, which represents the proportion of records within this set correctly classified (predicted to be Current and actually being Current). If we were to score another dataset with this model, how would persons with the same characteristics be classified? Because Clementine assigns the group the modal category of the branch, everyone in the new data set who met the criteria defined by these two rules would be predicted to remain Current Customers. If you would like to present the results to others, an alternative format is available that helps visualize the decision tree. The Viewer tab provides this alternative format.

Click the Viewer tab

Click the Decrease Zoom tool (to view more of the tree). (You may also need to expand the size of the window.)


Predictive Modeling With Clementine Figure 3.8 Decision Tree in the Viewer Tab

The root of the tree shows the overall percentages and counts for the three categories of CHURNED. Furthermore, the modal category is shaded. The first split is on Local, as we have seen already in the text display of the tree. Similar to the text display, we can decide to expand or collapse branches. In the right corner of some nodes a – or + is displayed, referring to an expanded or collapsed branch, respectively. For example, to collapse the tree at node 2:

Click in the lower right corner of node 2 (shown in Figure 3.9)


Predictive Modeling With Clementine Figure 3.9 Collapsing a Branch

In the Viewer tab, toolbar buttons are available for zooming in or out; showing frequency information as graphs and/or as tables; changing the orientation of the tree; and displaying an overall map of the tree in a smaller window (tree map window) that aids navigation in the Viewer tab. When it is not possible to view the whole tree at once, such as now, one of the more useful buttons in the toolbar is the Tree map button because it shows you the size of the tree. A red rectangle indicates the portion of the tree that is being displayed. You can then navigate to any portion of the tree you want by clicking on any node you desire in the Tree map window.

Click in the lower right corner of node 2

Click on the Treemap button in the tool bar Enlarge the Treemap until you see the node numbers (shown in Figure 3.10)


Predictive Modeling With Clementine Figure 3.10 Decision Tree in the Viewer Tab with a Tree Map

3.5 Generating and Browsing a Rule Set When building a C5.0 model, the C5.0 node can be instructed to generate the model as a rule set, as opposed to a decision tree. A rule set is a number of IF … THEN rules which are collected together by outcome. A rule set can also be produced from the Generate menu when browsing a C5.0 decision tree model.

In the C5.0 Rule browser window, click Generate…Rule Set


Predictive Modeling With Clementine Figure 3.11 Generate Ruleset Dialog

Note that the default Rule set name appends the letters “RS” to the output field name. You may specify whether you want the C5.0 Ruleset node to appear in the Stream Canvas (Canvas), the generated Models palette (GM palette), or both. You may also change the name of the rule set and lower limits on support (percentage of records having the particular values on the input fields) and confidence (accuracy) of the produced rules (percentage of records having the particular value for the output field, given values for the input fields).

Set Create node on: to GM Palette Click OK

Figure 3.12 Generated C5.0 Rule Set Node

Generated Rule Set for CHURNED

Click File…Close to close the C5.0 Rule browser window Right-click the C5.0 Rule Set node named CHURNEDRS in the generated Models palette

in the Manager, then click Browse


Predictive Modeling With Clementine Figure 3.13 Browsing the C5.0 Generated Rule Set

Apart from some details, this window contains the same menus as the browser window for the C5.0 Rule node.

Click All button to unfold Click Show or hide instance and confidence figures button in the toolbar

The numbered rules now expand as shown below.


Predictive Modeling With Clementine Figure 3.14 Fully Expanded C5.0 Generated Rule Set

For example Rule #1 (Current) has this logic: If a person makes more than 4.976 minutes of local calls a month, is Male and makes less than or equal to .905 minutes of International calls then we would predict Current. This form of the rules allows you to focus on a particular conclusion rather than having to view the entire tree. If the Rule Set is added to the stream, a Settings tab will become available that allows you to export the rule set in SQL format, which permits the rules to be directly applied to a database.

Click File…Close to close the Rule set browser window

3.6 Understanding the Rule and Determining Accuracy Unlike the generated Neural Net node, the predictive accuracy of the rule induction model is not given directly within the C5.0 node. To get that information, you can use an Analysis node like we did in the last chapter to compare Neural Net models. However, at this stage we will use Matrix nodes and Evaluation Charts to determine how good the model is.


Predictive Modeling With Clementine Creating a Data Table Containing Predicted Values We use the Table node to examine the predictions from the C5.0 model.

Place the generated C5.0 Rule model named CHURNED from the Models palette in the Manager to the right of the Type node

Connect the Type node to the generated C5.0 Rule model named CHURNED Place a Table node from the Output palette below the generated C5.0 Rule model

named CHURNED Connect the generated C5.0 Rule model named CHURNED to the Table node Right-click the Table node, then click Execute and scroll to the right in the table

Figure 3.15 Two New Fields Generated by the C5.0 Rule Node

Two new columns appear in the data table, $C-CHURNED and $CC-CHURNED. The first represents the predicted value for each record and the second the confidence value for the prediction.

Click File…Close to close the Table output window

Comparing Predicted to Actual Values As in the previous chapter, we will view a data matrix to see where the predictions were correct, and then we evaluate the model graphically with a gains chart.

Place two Select nodes from the Records palette, one to the lower right of the generated C5.0 node named CHURNED, and one to the lower left

Connect the generated C5.0 node named CHURNED to the each Select node


Predictive Modeling With Clementine First we will edit the Select node on the left that we will use to select the Training sample cases:

Double-click on the Select node on the left to edit it


Click the Select from existing field values button and insert the value 1_Training Click OK, and then click OK again to close the dialog


Now we will edit the Select node on the right to select the Testing sample cases:

Double-click on the Select node on the right to edit it


Click the Select from existing field values button and insert the value 2_Testing Click OK, and then click OK again to close the dialog

Now attach a separate Matrix node to each of the Select nodes. For each of the Select nodes:

Place a Matrix node from the Output palette below the Select node Connect the Matrix node to the Select node Double-click the Matrix node to edit it Put CHURNED in the Rows: Put $C-CHURNED in the Columns:



Click the Appearance tab Click the Percentage of row option Click on the Output tab and custom name the Matrix node for the Training sample as

Training and the Testing sample as Testing (this will make it easier to keep track of which output we are looking at)

Click OK For each actual churned category, the Percentage of row choice will display the percentage of records predicted into each of the outcome categories.

Execute each Matrix node Figure 3.17 Matrix Output for the Training and Testing Samples

Looking at the Training sample results, the model predicts about 79.8% of the Current category correctly, 100% of the Involuntary Leavers, and 93.6% of the Voluntary Leavers correctly. These results are far better than those found with a neural network for the Voluntary Leaver category (93.6% versus 82.0%), slightly better for Current Customers (79.8% versus 78.5%) and exactly the same for Involuntary Leavers (each model correctly predicted all of them). The results with the testing sample compare favorably which suggests that the model will perform well with new data.

Click File…Close to close the Matrix windows

To produce a gains chart for the Current group:

Place the Evaluation chart node from the Graphs palette to the right of the generated C5.0 Rule node named CHURNED

Connect the generated C5.0 Rule node named CHURNED to the Evaluation chart node Double-click the Evaluation chart node, and click the Include best line checkbox Click Execute


Predictive Modeling With Clementine Figure 3.18 Gains Chart of the Current Customer Group

Click Edit…Enable Interaction


Predictive Modeling With Clementine Figure 3.19 Gains Chart for the Current Customer Group (Interaction Enabled)

The gains line ($C-CHURNED) in the Training data chart rises steeply relative to the baseline, indicating the hits for the Current outcome are concentrated in the percentiles predicted most likely to contain current customers according to the model. Just under 75% of the hits were contained within the first 40 Percentiles. The gains line in the chart using Testing data is very similar which suggests that this model can be reliably used to predict current customers with new data.

Click File…Close to close the Evaluation chart window

Changing Target Category for Evaluation Charts By default, an Evaluation chart will use the first target outcome category to define a hit. To change the target category on which the chart is based, we must specify the condition for a User defined hit in the Options tab of the Evaluation node. To create a gains chart in which a hit is based on the Voluntary Leaver category:

Double-click the Evaluation node Click the Options tab Click the User defined hit checkbox

Click the Expression Builder button in the User defined hit group Click @Functions on the functions category drop-down list

Select @TARGET on the functions list, and click the Insert button Click the = button Right-click CHURNED in the Fields list box, then select Field Values Select Vol, and then click Insert button


Predictive Modeling With Clementine Figure 3.20 Specifying the Hit Condition within the Expression Builder

The condition (Vol as the target value) defining a hit was created using the Expression Builder.

Click OK Figure 3.21 Defining the Hit Condition for CHURNED


Predictive Modeling With Clementine In the evaluation chart, a hit will now be based on the Voluntary Leaver target category.

Click Execute Click Edit…Enable Interaction from the result gains chart (not shown)

Figure 3.22 Gains Chart for the Voluntary Leaver Category (Interaction Enabled)

The gains chart for the Voluntary Leavers category is better (steeper in the early percentiles) than that for the Current category. For example, the top 40 model-ordered percentiles in the Training data chart contain over 87% of the Voluntary Leavers as opposed to the same chart when we looked at Current Customers (that value was 75.3%)

Click File…Close to close the Evaluation chart window

To save this stream for later work:

Click File…Save Stream As Move to the c:\Train\ClemPredModel directory Type C5 in the File name: text box Click Save


Predictive Modeling With Clementine 3.7 Understanding the Most Important Factors in Prediction An advantage of rule induction models over neural networks is that the decision tree form makes it clear which fields are having an impact on the predicted field. There is no great need to use alternative methods such as web plots and histograms to understand how the rule is working. Of course, you may still use the techniques described in the previous chapter to help understand the model, but they may not be needed. Unlike the neural network, there is no sensitivity analysis performed on the model. The most important fields in the predictions can be thought of as those that divide the tree in its earliest stages. Thus in this example the most important field in predicting churn is Local. Once the model divided the data into two groups, those who do more local calling and those who do less, it will focus separately on each group to determine which predictors determine whether or not a customer will remain loyal to the company, voluntarily leave, or even be dropped as a customer. The process continues until the nodes either cannot be refined any further or stopping rules are reached which causes tree growth to stop.

3.8 Further Topics on C5.0 Modeling Now that we have introduced you to the basics of C5.0 modeling, we will discuss the Expert options which will allow you to refine your model even further. This time, we will use an existing stream rather than building one from scratch.

Click File…Open Stream and move to the c:\Train\ClemPredModel folder Double Click on DecisionTrees.str

The simple options within the C5.0 node allow you to use Boosting, specify the Expected noise (%) and whether the resulting tree favors Accuracy or Generality. Noisy (contradictory) data contain records in which the same, or very similar, predictor values lead to different outcome values. While C5.0 will handle noise automatically, if you have an estimate of it, the method can take this into account (see the section on Minimum Records and Pruning for more information on the effect of specifying a noise value). The expert mode allows you to fine-tune the rule induction process.

Double-click on the C5.0 node Click the Model tab Click the Expert Mode option button


Predictive Modeling With Clementine Figure 3.23 Expert Options Available within the C5.0 Dialog (Model Tab)

When constructing a decision tree, the aim is to refine the data into subsets that are, or seem to be heading toward, single-class collections of records on the outcome field. That is, ideally the terminal nodes contain only one category of the output field. At each point of the tree, the algorithm could potentially partition the data based on any one of the input fields. To decide which is the “best” way to partition the data—to find a compact decision tree that is consistent with the data—the algorithms construct some form of test that usually works on the basis of maximizing a local measure of progress.

Gain Ratio Selection Criterion Within C5.0, the Gain Ratio criterion, based on information theory, is used when deciding how to partition the data. In the following sections, we will describe, in general terms, how this criterion measures progress. However the reader is referred to C4.5: Programs for Machine Learning by J. Ross Quinlan (Published by Morgan Kaufmann, San Mateo CA, 1993) for a more detailed explanation of the original algorithm. The criterion used in the predecessors to C5.0 selected the partition that maximizes the information gain. Information gained by partitioning the data based on the outcomes of field X (an input or predictor field) is measured by:

GAIN(X) = INFO(DATA) – INFOX(DATA) Where INFO(DATA) represents the average information needed to identify the class (outcome category) of a record within the total data.


Predictive Modeling With Clementine And INFOX(DATA) represents the expected information requirement once the data has been partitioned into each outcome of the current field being tested. The information theory that underpins the criterion of gain can be given by the statement: “The information conveyed by a message depends on its probability and can be measured in bits as minus the logarithm to the base 2 of that probability. So, if for example there are 8 equally probable messages, the information conveyed by any one of them is – log2 (1 / 8) or 3 bits”. For details on how to calculate these values the reader is referred to Chapter 2 in C4.5: Programs for Machine Learning. Although the gain criterion gives good results, it has a flaw in that it favors partitions that have a large number of outcomes. Thus a symbolic predictor with many categories has an advantage over one with few categories. The gain ratio criterion, used in C5.0, rectifies this problem. The bias in the gain criterion can be rectified by a kind of normalization in which the gain attributable to tests with many outcomes is adjusted. The gain ratio represents the proportion of information generated by dividing the data in the parent node into each of the outcomes of field X that is useful, i.e., that appears helpful for classification.

GAIN RATIO(X) = GAIN(X) / SPLIT INFOX(DATA) Where SPLIT INFOX(DATA) represents the potential information generated by partitioning the data into n outcomes, whereas the information gain measures the information relevant to classification. The C5.0 algorithm will choose to partition the data based on the outcomes of the field that maximizes the information gain ratio. This maximization is subject to the constraint that the information gain must be large, or at least as great as the average gain over all tests examined. This constraint avoids the instability of the gain criterion, when the split is near trivial and the split information is thus small. Two other parameters the expert options allow you to control are the severity of pruning and the minimum number of records per child branch. In the following sections we will introduce each of these in turn and give advice on their settings.

Pruning and Attribute Winnowing Within C5.0 Within C5.0, once the tree has been built, it can be pruned back to create a more general (and less bushy) tree. Within the expert mode, the Pruning severity option allows you to control the extent of the pruning. The higher this number, the more severe the pruning and the more general the resulting tree. The algorithm used to decide whether a branch should be pruned back toward the parent node is based on comparing the predicted errors for the “sub-tree” (i.e. unpruned branches) with those for the “leaf” (or pruned node). Error estimates for leaves and sub-trees are calculated based on a set of unseen cases the same size as the training set. The formula used to calculate the predicted error rates for a leaf involves the number of cases within the leaf, the number of these cases that have been incorrectly classified within this leaf and confidence limits based on the binomial distribution. The reader is referred to Chapter 4 in C4.5: Programs for Machine Learning for a more detailed description of error estimations and pruning in general.


Predictive Modeling With Clementine A second phase of pruning (global pruning) is then applied by default. It prunes further based on the performance of the tree as a whole, rather than at the sub-tree level considered in the first stage of pruning. This option (Use global pruning) can be turned off, which generally results in a larger tree. After initially analyzing the data, the Winnow attributes option will discard some of the inputs to the model before building the decision tree. This can produce a model that uses fewer input fields yet maintains near the same accuracy, which can be an advantage in model deployment. This option can be especially effective when there are many inputs and where inputs are statistically related.

Minimum Records Per Child Branch One other consideration when building a general decision tree is that the terminal nodes within the tree are not too small in size. Within the C5.0 dialog, you control the Minimum records per child branch, which specifies that at any split point in the tree, at least two sub-trees must cover at least this number of cases. The default is two cases but increasing this number can be useful for noisy data sets and tends to produce less bushy trees.

How to Use Pruning and Minimum Records Per Branch As previously mentioned, within the C5.0 dialog the Simple mode allows you to specify both the Expected noise (%) and whether the resulting tree favors Accuracy or Generality.

• If the algorithm is set to favor Accuracy, the Pruning Severity is set to 75 and the Minimum records per branch is 2; hence, although the tree is accurate there is a degree of generality by not allowing the nodes to contain only one record.

• If the algorithm is set to favor Generality the Pruning Severity is set to 85 and the Minimum records per branch is 5.

• If the Expected noise (%) is used the Minimum records per branch is set to half of this value.

Once a tree has been built using the simple options, the expert options may be used to refine the tree in these two common ways.

• If the resulting tree is large and has too many branches increase the Pruning Severity • If there is an estimate for the expected proportion of noise (relatively rare in practice), set

the Minimum records per branch to half of this value.

Boosting C5.0 has a special method for improving its accuracy rate, called boosting. It works by building multiple models in a sequence. The first model is built in the usual way. Then, a second model is built in such a way that it focuses especially on the records that were misclassified by the first model. Then a third model is built to focus on the second model's errors, and so on. Finally, cases are classified by applying the whole set of models to them, using a weighted voting procedure to combine the separate predictions into one overall prediction. Boosting can significantly improve the accuracy of a C5.0 model, but it also requires longer training. The Number of trials option allows you to control how many models are used for the boosted model. While boosting might appear to offer something for nothing, there is a price. When model building is complete, more than one tree is used to make predictions. Therefore, there is no


Predictive Modeling With Clementine simple description of the resulting model, nor of how a single predictor affects the outcome field. This can be a serious deficiency, so boosting is normally used when the chief goal of an analysis is predictive accuracy, not understanding.

Misclassification Costs The Costs tab allows you to set misclassification costs. When using a tree to predict a symbolic output, you may wish to assign costs to misclassifications (where the tree predicts incorrectly) to bias the model away from “expensive” mistakes. The Misclassifying controls allow you to specify the cost attached to each possible misclassification. The default costs are set at 1.0 to represent that each misclassification is equally costly. When unequal misclassification costs are specified, the resulting trees tend to make fewer expensive misclassifications, usually at the cost of an increased number of the relatively inexpensive misclassifications.

3.9 Modeling Symbolic Outputs with Other Decision Tree Algorithms As we saw in Table 3.1, C5.0 can only be used to model symbolic outcomes. Quest has the same limitation. The other two algorithms, CHAID and C&R Tree can be used to model both symbolic and continuous outcomes. Before we discuss how to create models with continuous targets, let’s take a look at the various options for modeling symbolic outcomes in CHAID, C&RT and Quest. In the interest of time, we will not go through an entire example as we did with C5.0, but instead will leave that for you to do in the exercises.

3.10 Modeling Symbolic Outputs with CHAID First, we’ll look at the CHAID node and the options available there.

Click Cancel to close the C5.0 dialog Double-click the CHAID node named CHURNED

There are two available methods, CHAID and Exhaustive CHAID. The latter is a modification of CHAID designed to address some of its weaknesses. Exhaustive CHAID examines more possible splits for a predictor, thus improving the chances of finding the best predictor (at the cost of additional processing time).


Predictive Modeling With Clementine Figure 3.24 CHAID Node Dialog (Model Tab)

CHAID allows only one other change in the Model tab, the maximum tree depth (the number of levels the tree can grow). Since CHAID doesn’t prune a bushy tree, the user can specify the depth with the Levels below root setting, which is 5 initially. This setting should depend on the size of the data file, the number of predictors, and the complexity of the desired tree. You can set one of two modes: Generate model builds the model, Launch Interactive session launches the Interactive Tree feature which we will discuss this feature in a later section.

Click the Expert tab, and then click the Expert Mode option button The Expert mode options are shown in Figure 3.25. To select the predictor for a split, CHAID uses a chi-square test in the table defined at each node by a predictor and the outcome field. CHAID chooses the predictor that is the most significant (smallest p value). If that predictor has more than 2 categories, CHAID compares them and collapses together those categories that show no differences in the outcome. This category merging process stops when all remaining categories differ at the specified testing level (Alpha for Merging:). It is possible for CHAID to split merged categories, controlled by the Allow splitting of merged categories check box. (Note that a categorical predictor with more than 127 discrete categories will be ignored by CHAID.) For continuous predictors, the values are binned into a maximum of 10 groups, and then the same tabular procedure is followed as for flag and set types.


Predictive Modeling With Clementine Figure 3.25 CHAID Expert Options

Because many chi square tests are performed, CHAID automatically adjusts its significance values when testing the predictors. These are called Bonferroni adjustments and are based on the number of tests. You should normally leave this option turned on; in small samples or with only a few predictors, you could turn it off to increase the power of your analysis.

Click the Stopping… button The Stopping options control stopping rules for growth based on node size. These can be specified either as an absolute number of records or as a percentage of the total number of records. By default, a parent branch to be split must contain at least 2% of the records; a child branch must contain at least 1%. It is often more convenient to work with the absolute number of records rather than a percent, but in either case, you will very likely modify these values to get a smaller, or larger, tree. Figure 3.26 CHAID Stopping Criteria


Predictive Modeling With Clementine Unlike other models, CHAID uses missing, or blank, values when growing a tree. All blank values are placed in a missing category that is treated like any other category for nominal predictors. For ordinal and continuous predictors, the process of handling blanks is a bit different, but the effect is the same (see the Clementine11.1 Algorithms Guide for detailed information). If you don’t want to include blank data in a model, it should be removed beforehand.

3.11 Modeling Symbolic Outputs with C&R Tree We move next to the C&R Tree node to predict a symbolic output field.

Click Cancel, and then Cancel again to close the CHAID dialogs Double-click on the C&R Tree node named CHURNED Click the Model tab

The only Simple mode model option available concerns the Maximum tree depth, as with CHAID. By default this value is 7, so C&R Tree will grow a deeper tree than CHAID, all things equal (which they really aren’t given the different methods for predictor selection). It is also possible, as with CHAID, to grow a tree interactively. Since pruning is performed with this method, and other stopping rules may be triggered, the actual tree depth may be less than the maximum specified. Figure 3.27 Classification and Regression Trees (C&R Tree) Dialog

Click the Expert tab Click the Expert Mode option button


Predictive Modeling With Clementine Figure 3.28 Expert Options for Classification and Regression Trees

Pruning within C&RT The Prune tree check box will invoke pruning. The standard error rule allows C&R Tree to select the simplest tree whose risk estimate (which is the proportion of errors the tree model makes when equal misclassification costs and empirical priors are used) is close to that of the subtree with the smallest risk. The Multiplier indicates how many standard errors difference are allowed in the risk estimate between the final tree and the tree with the smallest risk. As the multiplier is increased, the pruning will be more severe. The Stopping options control stopping rules for growth based on node size, with the same settings as for CHAID, so we won’t review them here. We do note that, unlike CHAID, while the default values may seem small, it is important to keep in mind that pruning is an important component of C&R Tree, and it can trim back some of the small branches.

Impurity Criterion The criterion that guides tree growth in C&R Tree with a symbolic output field is called impurity. It captures the degree to which responses within a node are concentrated into a single output category. A pure node is one in which all cases fall into a single output category, while a node with the maximum impurity value would have the same number of cases in each output category. Impurity can be defined in a number of ways and two alternatives are available within the C&R Tree procedure. The default, and more popular measure, is the Gini measure of dispersion. If P(t)i is the proportion of cases in node t that are in output category i, then the Gini measure is:

2)(1 ∑−=i

itPGini


Predictive Modeling With Clementine Alternatively:

jji

i tPtPGini )()(∑≠

=

If two nodes have different distributions across three response categories (for example (1,0,0) and (1/3, 1/3, 1/3)), the one with the greater concentration of responses in a single category (the first one) will have the lower impurity value (for (1,0,0) the impurity is 1 – (12 + 02 + 02), or 0; for (1/3, 1/3, 1/3) the impurity is 1 – ((1/3)2 + (1/3)2 + (1/3)2)or .667). The Gini measure ranges between 0 and 1, although the maximum value is a function of the number of output categories. Thus far we have defined impurity for a single node. It can be defined for a tree as the weighted average of the impurity values from the terminal nodes. When a node is split into two child nodes, the impurity for that branch is simply the weighted average of their impurities. Thus if two child nodes resulting from a split have the same number of cases and their individual impurities are .4 and .6, their combined impurity is .5*.4 + .5*.6. When growing the tree, C&R Tree splits a node on the predictor that produces the greatest reduction in impurity (comparing the impurity of the parent node to the impurity of the child nodes). This change in impurity from a parent node to its child nodes is called the improvement and under Expert options you can specify the minimum change in impurity for tree growth to continue. The default value is .0001 and if you are considering modifying this value, you might calculate the impurity at the root node (the overall output proportions) to establish a point of reference. The problems with using impurity as a criterion for tree growth are that you can almost always reduce impurity by enlarging the tree and any tree will have 0 impurity if it is grown large enough (if every node has a single case, impurity is 0). To address these difficulties, the developers of the classification and regression tree methodology (see Breiman, Friedman, Olshen, and Stone, Classification and Regression Trees, Wadsworth, 1984) developed a pruning method based on a cross-validated cost complexity measure (as discussed above). By default, the Gini measure of dispersion is used. Breiman and colleagues proposed Twoing as an alternative impurity measure. If the target has more than two output categories, twoing will create binary splits of the response categories in order to calculate impurity. Each possible combination of output categories split into two groups will be separately evaluated for impurity with each predictor variable, and the best split across predictors and target category combinations is chosen. Ordered Twoing (inactive because the output field is type set, not ordered set) applies Twoing as described above, except that the output category combinations are limited to those consistent with the rank order of the categories. For example, if there are five output categories numbered 1,2,3,4 and 5, Ordered Twoing would examine the (1,2) (3,4,5) split, but the (1,4) (2,3,5) split would not be considered, since only contiguous categories can be grouped together. Of the methods, the Gini measure is most commonly used.

Surrogates Surrogates are used to deal with missing values on the predictors. For each split in the tree, C&R Tree identifies the input fields (the surrogates) that are most similar statistically to the selected split field. When a record to be classified has a missing value for a split field, its value on a surrogate field can be used to make the split.


Predictive Modeling With Clementine The Maximum surrogates option controls how many surrogate predictor fields will be stored at each mode. Retaining more surrogates slows processing and the default (5) is usually adequate.

Priors in C&RT Historically, priors have been used to incorporate knowledge about the base population rates (here of the output field categories) into the analysis. Breiman et al. (1984) point out that if one target category has twice the prior probability of occurring than another, it effectively doubles the cost of misclassifying a case from the first category, since it is counted twice. Thus by specifying a larger prior probability for a response category, you can effectively increase the cost of its misclassification. Since priors are only given at the level of the base rate for the output field categories (with J categories there are J prior probabilities), use of them implies that the misclassification of a record actually in output category j has the same cost regardless of the category into which it is misclassified, (that is C(j) = C(k|j), for all k not equal to j).

Click Priors button By default, the prior probabilities are set to match the probabilities found in the training data. The Equal for all classes option allows you to set all priors equal (might be used if you know your sample does not represent the population and you don’t know the population distribution on the outcome), and you can enter prior probabilities (Custom option). The prior probabilities should sum to 1 and if you enter custom priors that reflect the desired proportions, but do not sum to 1, the Normalize button will adjust them. Finally, priors can be adjusted based on misclassification costs (see Breiman’s comment above) entered in the Costs tab. Figure 3.29 C&RT Expert Options: Priors


Predictive Modeling With Clementine Misclassification Costs in C&RT Incorporating misclassification costs in the analysis is not an expert option, although modifying priors based on misclassification costs is.

Click Cancel Click Costs tab

As we briefly discussed in the context of C5.0, unequal misclassification costs can be specified for the outcome categories and will be taken into account during tree creation. By default, all misclassification costs are set equal (to 1). Figure 3.30 Misclassification Costs

3.12 Modeling Symbolic Outputs with QUEST We move next to the QUEST node for predicting a symbolic field.

Click Cancel to close the C&R Tree dialog Double-click on the QUEST node named CHURNED Click the Model tab

QUEST (Quick Unbiased Efficient Statistical Tree) is a binary classification method that was developed, in part, to reduce the processing time required for large C&R Tree analyses with many fields and/or records. It also tries to reduce the tendency in decision tree methods to favor predictors that allow more splits (see Loh and Shi, 1997).


Predictive Modeling With Clementine Figure 3.31 QUEST Model Tab Options

Like CHAID and C&R Tree, QUEST allows only the specification of maximum tree depth in Simple mode. A model can be built or an Interactive Tree session launched.

Click the Expert tab Click the Expert mode option button

QUEST separates the tasks of predictor selection and splitting at a node. Like CHAID, it uses statistical tests to pick a predictor at a node. For each continuous or ordinal predictor variable, QUEST performs an analysis of variance, and then uses the significance of the F test as a criterion. For nominal predictors (of type flag and set), chi-square tests are performed. The predictor with the smallest significance value from either the F or chi-square test is selected. Although not evident from the dialog box options, Bonferroni adjustments are made, as with CHAID (not under user control). QUEST is more efficient than C&R Tree because not all splits are examined, and category combinations are not tested when evaluating a predictor for selection.


Predictive Modeling With Clementine Figure 3.32 QUEST Expert Options

After selecting a predictor, QUEST determines how the field should be split (into two groups) by doing a quadratic discriminant analysis, using the selected predictor on groups formed by the target categories. The details are rather complex and can be found in Loh and Shih (1997). The type (nominal, continuous) of the predictor will determine how it is treated in this method. While quadratic discriminant analysis allows for unequal variances in the groups and makes one fewer assumption than does linear discriminant analysis, it does assume that the distribution of the data are multivariate normal, which is unlikely for predictors that are flags and sets. QUEST uses an alpha (significance) value of .05 for splitting in the discriminant analysis, and you can modify this setting. For large files you may wish to reduce alpha to .01, for example.

Pruning, Stopping, Surrogates QUEST follows the same pruning rule as does C&R Tree, using a cost-complexity measure that takes into account the increase in error if a branch is pruned, using a standard error rule. The Stopping choices are the same as for CHAID and C&R Tree. QUEST also uses surrogates to allow predictions for missing values, employing the same methodology as C&R Tree.

3.13 Interactive Trees Decision trees can be generated automatically, allowing the algorithm to find the best complete tree. As an alternative, you can use the Interactive Tree Builder to take control of the tree-building process. You can grow the tree level-by-level, you can select specific predictors at a node, and you can insure that the tree is not too complex so that it is practical enough to be used for a business problem. To use the Tree Builder, we simply specify a tree model as usual, with the one addition of selecting Launch interactive session in the Build area on the Model tab. All decision trees except


Predictive Modeling With Clementine C5.0 support interactive trees. One caution is that all ordered sets used in the model must have numeric storage to be used in the Tree Builder. We’ll examine tree building by using CHAID to predict the CHURNED field in the current stream.

Click Cancel to close the QUEST node Double-click the CHAID node to edit it Click the Model tab Click the Launch interactive session option button

When Launch interactive session is selected, the Use tree directives check box becomes active. Tree directives are used to save the tree in the Model Builder. The next time the model-building node is executed, the current tree, including any custom splits you define, will automatically be regrown if you specify the saved directives file under the Directives… button. Figure 3.33 CHAID Model Tab with Interactive Tree Method

Interactive Tree Builder When the model executes, a generated CHAID model is not added to the Model Manager area. Instead, the Tree Builder opens automatically, as shown in Figure 3.34.

Click Execute to run the model The tree opens in the Viewer tab (see Figure 3.34), displaying the root node, with the distribution of the outcome field (here CHURNED). You can grow the tree level-by-level, or all at once, or variations on this. You can also request information on the tree.


Predictive Modeling With Clementine Figure 3.34 Interactive Tree Builder to Predict CHURNED Field with CHAID

Right clicking on a node brings up a context menu with several options.

Right-click on the root node Figure 3.35 Context Menu With Tree-Growing Choices

Because there has been no tree growth yet, some of the options are inactive. From here, we can grow the full tree, grow the tree only one level, or grow just the selected branch (the latter two are equivalent here). We can also specify a custom split if there is interest in a particular predictor.


Predictive Modeling With Clementine Let’s grow the tree one level.

Click on Grow Tree One Level CHAID selects the field LONGDIST (long distance minutes) as the best predictor. This field is of type range, so it was binned before being used in the model. CHAID finds the statistically best split is into three groups, as shown in Figure 3.36. Notice that all the involuntary leavers are in the first node, with a value of 0 on LONGDIST. Figure 3.36 Interactive Tree Grown One Level with LONGDIST as Best Predictor

Although we are hardly finished with building the tree, we can learn how accurate the current tree is at any time.

Click the Risks tab


Predictive Modeling With Clementine The Risks tab displays the error of the current tree in predicting CHURNED. The Risk estimate for the Training data (0.370) is the amount of error, so the model is (1 - .370)*100 = 63.0% accurate in its predictions. This isn’t too bad for a model with only one predictor. The error rate for the Testing data (0.381) is nearly the same, which is a good sign that the model will work well with unseen data. One concern is that the model incorrectly predicts that most Voluntary Leavers (217 out of 267) will remain Current Customers. This suggests that the model will have to be refined latter on, perhaps by balancing the three groups, so that the model will do a better job in finding the people who are likely to leave the company. Figure 3.37 Risks Estimates for Current CHAID Model

Click the Viewer tab Right-click on Node 2 and select Grow Branch with Custom Split from the Context

menu We can control how the tree will split at this node. In Figure 3.38, we can see if we grow this node automatically, SEX will be used to grow the branch.


Predictive Modeling With Clementine Figure 3.38 Define Split Dialog for Node 2

We can see what other fields are statistically significant.

Click the Predictors… button


Predictive Modeling With Clementine Figure 3.39 Potential Fields For Splitting Node 2

Four fields meet the standard statistical criterion of .05 probability (adjusted by the Bonferroni correction). Since International is also highly significant, let’s select that field instead.

Click International Click OK Click Grow

Figure 3.40 shows the resulting split. The split is at 0 minutes of international calls, into 2 child nodes. Customers who don’t make international calls are more likely to be current customers. Those who do make international calls are more likely to be voluntary churners, so we will finally make a prediction for this group. Note, of course, that this branch only applies to customers who make between 0.00 and 26.055 minutes of long distance calls. The red symbol under Node 2 indicates that the tree was not grown automatically at that node but that the user instead defined the split.


Predictive Modeling With Clementine Figure 3.40 Tree Grown One Level From Node 2 With User-Selected Predictor

Instead of examining these results any further, let’s remove the International branch and allow the model to use the best predictor, which was Sex, as we saw in Figure 3.38.

Right-click on Node 2 and select Remove One Level from the context menu Right-click on Node 2 and select Grow Tree One Level from the context menu


Predictive Modeling With Clementine Figure 3.41 Tree Grown One Level From Node 2 With the Most Significant Predictor

There are other options available. The tree display can be modified by displaying graphs instead of tabular statistics, or both, or by hiding certain branches. You can zoom in or out, or display a Tree Map window to help navigate large trees. On the Gains tab, various statistics and charts are available that can help you readily assess the effectiveness of the current tree (the graphs are those produced by the Evaluation node). One other very useful characteristic of the Gains table is to help you identify which segments contain a particular class of customers, for example Voluntary Leavers, even though the Misclassification Table may have incorrectly predicted them into the wrong category.

Click on the Gains tab Select Vol from the menu to the right of Target Category


Predictive Modeling With Clementine Figure 3.42 Gains Table Results

Each row in the Gains table represents a terminal node in the tree. The Node:n column is the total number of customers in each node; the Node (%) column is the node size as a percent of the entire population; the Gain column is a count of Voluntary Leavers within each node; the Gain % column is the percent of all the Voluntary Leavers for the entire sample within each node; the Response % column is the percent of people in the node who are Voluntary Leavers; and the Index % column indicates the relative probability of choosing a Voluntary Leaver within the node versus randomly choosing cases in the entire sample. The Index % value for Node 6 is derived by dividing 63.86 (Response %) by 37.135 (Percent of Voluntary Leavers in the Entire Sample) and multiplying by 100. If the index percentage is greater than 100%, you have a better chance of finding cases with the desired target category in a node. Within the Gains Table we can now see that 68.16% of all Voluntary Leavers come from Node 6. From looking at the tree, these are the people who are both female and make between 0 and 26.055 minutes of long distance calling per month. Interactive trees are not a model, but instead are a form of output, like a table or graph. When you are satisfied with the tree you have built, you can generate a model to be used in the stream to make predictions.

Click Generate…Generate Model Click OK in the resulting dialog box (not shown) Close the Tree Builder window

A generated CHAID model appears in the upper left corner of the Stream Canvas. It can be edited, attached to other nodes, and used like any other generated model. The only difference is in how it was created.


Predictive Modeling With Clementine 3.14 Predicting Numeric Fields Two of the decision tree models, CHAID and C&R Tree, can predict a numeric field (of type range). We will briefly review the options available for this type of outcome.

Numeric Outputs with C&R Tree When a numeric field is used as the output field in C&R Tree, the algorithm runs in the way described earlier in this chapter. For a numeric output field (the regression trees portion of the algorithm), the impurity criterion is still used but is based on a measure appropriate for a continuous variable: within-node variance. It captures the degree to which records within a node are concentrated around a single value. A pure node is one in which all cases have the same output value, while a node with a large impurity value (in principle, the theoretical maximum would be infinity) would contain cases with very diverse values on the output field. For a single node, the variance (or standard deviation squared) of the output field is calculated from the records within the node. When generating a prediction, the algorithm uses the average value of the outcome field within the terminal node.

Click File…Open Stream and move to c:\Train\ClemPredModel (if necessary) Double-click on CRTree.str

Figure 3.43 Clementine Stream with C&R Tree Model Node (Numeric Output Field)

The data file consists of patient admissions to a hospital. The goal is to build a model predicting insurance claim amount based on hospital length of stay, severity of illness group and patient age.

Double click on the C&R Tree node (labeled CLAIM) The Model tab for the C&R Tree dialog with a numeric output is the same as for a symbolic output. However, if we explore the Expert options we find that some expert settings—priors and misclassification costs—which are not relevant for a continuous output field, are inactive when Expert mode is chosen. Otherwise, setting up the model-building parameters, and executing the tree, is identical to the process for a categorical outcome. The generated model will display the predicted mean for the insurance claim amount in each terminal node.


Predictive Modeling With Clementine Figure 3.44 C&R Tree Dialog for a Numeric Output Field

Click the Launch interactive session button Click the Expert Tab Click the Expert Mode option button Click Prune Tree

Figure 3.45 C&R Tree Dialog Expert Options

By selecting Prune Tree, pruning will be performed to produce a more compact tree.



Click Execute Figure 3.46 Interactive Tree Builder to Predict Claim with C&R Tree

Because the target field is numeric, different statistics appear in the nodes. Now the nodes display the mean, number of cases, and percentage of the sample. Thus, the mean insurance claim for persons in this data file is slightly over $4631, which we would predict for each person if we didn’t know how long they stayed in the hospital, their age, or how severely ill they were. Once the tree is grown, we should get some insight into the characteristics of patients that separate high insurance claims from low ones.

Right-Click on the Node 0 and select Grow Tree and Prune


Predictive Modeling With Clementine Figure 3.47 Interactive Tree Builder Fully Grown and Pruned Tree

The results indicate that Length of Stay is the best predictor. The average insurance claim for persons who stay more than 2.5 days in the hospital is $5742.50 while the mean claim for persons who spent less time in the hospital is only $4276.62. The predictions are further refined as we work our way down the tree. For example, the average claim for individuals who were in the hospital more than 3.5 days (Node 6) was $8032.73 and for persons who spent less than or equal to 2.5 days in the hospital and were older than 28.5, the average prediction was $4108.60. While this tree is perfectly valid, some of the results don’t make intuitive sense. For example, it may not be possible to be charge for 2.5 days; most hospitals charge you for an entire day even if you stayed a fraction thereof. Similarly, the Severity of Illness field (ASG) has only 3 values and we are treating it as if it is numeric. An examination of the Type node will show how each field is currently typed.

Double-click on the Type Node in the Stream Canvas


Predictive Modeling With Clementine Figure 3.48 Type Node Original Field Types

As we can see, all the fields have been typed as numeric. In order to have the model treat each day in the hospital as an entire day, and to insure that Severity of Illness groups remain discrete, we need to reinstantiate the LOS and ASG field so they are typed as SET.

Hold down the Control key and click on the ASG and LOS fields to select them Right-click Type box for either of these two variables Click Set Type…Discrete on the context menu Click the Read Values button Click OK

Figure 3.49 Type Node after ASG and LOS have been Reinstantiated to SET



Close the previous Interactive Tree window Execute the C&RT node Right-click on Node 0 and select Grow Tree and Prune from the context menu

Figure 3.50 Tree after Reinstantiating ASG and LOS to SET

While Length of Stay is still the best predictor, number of days is now displayed as a whole number. The average claim for persons who stayed 1 or 2 days is $4276.62 and for those who stayed 3, 4 or 6 days, the average claim is $5742.50. Also, the values for ASG are now treated as discrete.

Numeric Outputs With CHAID When CHAID is used with a continuous target, the overall approach is identical to what we have discussed above, but the specific tests used to select predictors and merge categories differ. An analysis of variance test is used for predictor selection and merging of categories, with the target as the dependent variable. Nominal and ordinal predictors are used in their untransformed form. Continuous predictors are binned as described above for CHAID into at most 10 categories; then an analysis of variance test is used on the transformed variable. The field with the lowest p value for the ANOVA F test is selected as the best predictor at a node, and the splitting and merging of categories proceeds based on additional F tests. We don’t need to save the stream files for this chapter.




Exercise Data Information Sheets The exercises in this chapter is written around the data files churn.txt. The following section give details of the file. churn.txt contains information from a telecommunications company. The data are comprised of customers who at some point have purchased a mobile phone. The primary interest of the company is to understand which customers will remain with the organization or leave for another company. The files contain the following fields: ID Customer reference number LONGDIST Time spent on long distance calls per month International Time spent on international calls per month LOCAL Time spent on local calls per month DROPPED Number of dropped calls PAY_MTHD Payment method of the monthly telephone bill LocalBillType Tariff for locally based calls LongDistanceBillType Tariff for long distance calls AGE Age SEX Gender STATUS Marital status CHILDREN Number of Children Est_Income Estimated income Car_Owner Car owner CHURNED (3 categories) Current – Still with company Vol – Leavers who the company wants to keep Invol – Leavers who the company doesn’t want In this session we will explore the various training methods and options for the rule induction techniques within Clementine.

1. Begin a new stream with a Var.file node connected to the file Churn.txt. 2. Use C5.0 and at least one other decision tree method to predict CHURNED and compare

the accuracy of both. What do you learn from this? Which rule method performs “best”?



3. Now browse the rules that have been generated by the methods. Which model appears to be the most manageable and/or practical? Do you think there is a trade-off between accuracy and manageability?

4. Does the Balance node have much of an effect with the rule induction techniques?

5. Try switching from Accuracy to Generality in C5.0. Does this have much effect on the

size and accuracy of the tree?

6. Experiment with the expert options within the methods you selected to see how they affect tree growth. Can you increase the accuracy without making the model overly complicated? Experiment with the minimum values for parent and child nodes and see how this influences the size of the tree.

7. In C5.0, use the Winnow attributes expert option and see if it reduces the number of

inputs used in the model (hint: for an easier comparison, generate a Filter node from the generated C5.0 node with Winnow attributes checked and with it unchecked).

8. Of all the models you have run, which do you think is the “best”? Why?

9. For those with extra time: Use C5.0 or other decision tree methods to predict Response to

campaign from the charity.sav data used in the exercises for Chapter 2. How do the rule induction models compare with the neural network models built in the last chapter? Which are the most accurate? Which are the easiest to understand?

You may wish to save the stream (use the name Exer3.str) that you have just created.



Chapter 4: Linear Regression

Objectives • Review the concepts of linear regression • Use the Regression node to model medical insurance claims data

4.1 Introduction Linear regression is a method familiar to just about everyone these days. It is the classic general linear model (GLM) technique, and it is used to predict an outcome variable that is interval or ratio in scale with predictors that are also interval or ratio. In addition, categorical input fields can be included by creating dummy variables (fields). A Regression model node that performs linear regression is available in Clementine. Linear regression assumes that the data can be modeled with a linear relationship. To illustrate, the figure below contains a scatterplot depicting the relationship between the length of stay for hospital patients and the dollar amount claimed for insurance. Superimposed on the plot is the best-fit regression line. The plot may look a bit unusual in that there are only a few values for length of stay, which is recorded in whole days, and few patients stayed more than three days. Figure 4.1 Scatterplot of Hospital Length of Stay and Insurance Claim Amount

Although there is a lot of spread around the regression line and a few outliers, it is clear that there is a positive trend in the data such that longer stays are associated with a greater insurance claims. Of course, linear regression is normally used with several predictors; this makes it impossible to

Linear Regression 4 - 1

Predictive Modeling With Clementine display the complete solution with all predictors in convenient graphical form. Thus, most users of linear regression use the numeric output.

4.2 Basic Concepts of Regression In the plot above, to the eye (as well as to one's economic sense) there seems to be a positive relation between length of stay and the amount of a health insurance claim. However, it would be more useful in practice to have some form of prediction equation. Specifically, if some simple function can approximate the pattern shown in the plot, then the equation for the function would concisely describe the relation and could be used to predict values of one variable given knowledge of the other. A straight line is a very simple function and is usually what researchers start with, unless there are reasons (theory, previous findings, or a poor linear fit) to suggest another. Also, since the goal of much research involves prediction, a prediction equation is valuable. However, the value of the equation would be linked to how well it actually describes or fits the data, and so part of the regression output includes fit measures.

The Regression Equation and Fit Measure In the plot above, insurance claim amount is placed on the Y (vertical axis) and the length of stay appears along the X (horizontal) axis. If we are interested in insurance claim as a function of the length of stay, we consider insurance claim to be the output field and length of stay as the input or predictor field. A straight line is superimposed on the scatterplot along with the general form of the equation: Yi = A + B * Xi + ei Here, B is the slope (the change in Y per one unit change in X) and A is the intercept (the value of Y when X is zero, and ei is the model residual or error for the ith observation). Given this, how would one go about finding a best-fitting straight line? In principle, there are various criteria that might be used: minimizing the mean deviation, mean absolute deviation, or median deviation. Due to technical considerations, and with a dose of tradition, the best-fitting straight line is the one that minimizes the sum of the squared deviation of each point about the line. Returning to the plot of insurance claim amount and length of stay, we might wish to quantify the extent to which the straight line fits the data. The fit measure most often used, the r-square measure, has the dual advantages of being measured on a standardized scale and having a practical interpretation. The r-square measure (which is the correlation squared, or r2, when there is a single input field, and thus its name) is on a scale from 0 (no linear association) to 1 (perfect prediction). Also, the r-square value can be interpreted as the proportion of variation in one field that can be predicted from the other. Thus an r-square of .50 indicates that we can account for 50% of the variation in one field if we know values of the other. You can think of this value as a measure of the improvement in your ability to predict one field from the other (or others if there is more than one input field). Multiple regression represents a direct extension of simple regression. Instead of a single input field (Yi = A + B * Xi + ei) multiple regression allows for more than one input field. Yi = A + B1 * X1i + B2 * X2i + B3 * X3i + . . . + ei


Predictive Modeling With Clementine in the prediction equation. While we are limited to the number of dimensions we can view in a single plot, the regression equation allows for many input fields. When we run multiple regression we will again be concerned with how well the equation fits the data, whether there are any significant linear relations, and estimating the coefficients for the best-fitting prediction equation. In addition, we are interested in the relative importance of the independent fields in predicting the output field.

Residuals and Outliers Viewing the plot, we see that many points fall near the line, but some are more distant from it. For each point, the difference between the value of the dependent field and the value predicted by the equation (value on the line) is called the residual (ei). Points above the line have positive residuals (they were under predicted), those below the line have negative residuals (they were over predicted), and a point falling on the line has a residual of zero (perfect prediction). Points having relatively large residuals are of interest because they represent instances where the prediction line did poorly. As we will see shortly in our detailed example, large residuals (gross deviations from the model) have been used to identify data errors or possible instances of fraud (in application areas such as insurance claims, invoice submission, or telephone and credit card usage).

Assumptions Regression is usually performed on data for which the input and output fields are interval scale. In addition, when statistical significance tests are performed, it is assumed that the deviations of points around the line (residuals) follow the normal bell-shaped curve. Also, the residuals are assumed to be independent of the predicted values (values on the line), which implies that the variation of the residuals around the line is homogeneous (homogeneity of variance). Clementine can provide summaries and plots useful in evaluating these latter issues. One special case of the assumptions involves the interval scale nature of the independent variable(s). A field coded as a dichotomy (say 0 and 1) can technically be considered as an interval scale. An interval scale assumes that a one-unit change has the same meaning throughout the range of the scale. If a field’s only possible codes are 0 and 1 (or 1 and 2, etc.), then a one-unit change does mean the same change throughout the scale. Thus dichotomous or flag fields (for example gender) can be used as predictor variables in regression. It also permits the use of categorical predictor fields if they are converted into a series of flag fields, each coded 0 or 1; this technique is called dummy coding. Note that the Regression node in Clementine will only accept numeric inputs or ordered sets that contain numeric values (they will be treated as numeric, then). Thus if you have symbolic inputs, you must convert them to numeric dummy fields (using the SetToFlag node to create 0,1 dummy-coded fields, followed by a Type node to set the type of these fields to range), before they can be used by the Regression node.

4.3 An Example: Error or Fraud Detection in Claims To illustrate linear regression we turn to a data set containing insurance claims (CLAIM) for a single medical treatment performed in a hospital (in the US, a single DRG or diagnostic related group). In addition to the claim amount, the data file also contains patient age (AGE), length of


Predictive Modeling With Clementine hospital stay (LOS) and a severity of illness category (ASG). This last field is based on several health measures and higher scores indicate greater severity of the illness. The plan is to build a regression model that predicts the total claim amount for a patient on the basis of length of stay, severity of illness and patient age. Assuming the model fits, we are then interested in those patients that the model predicts poorly. Such cases can simply be instances of poor model fit, or the result of predictors not included in the model, but they also might be due to errors on the claims form or fraudulent entries. Thus we are approaching the problem of fraud detection by identifying exceptions to the prediction model. Such exceptions are not necessarily instances of fraud, but since they are inconsistent with the model, they may be more likely to be fraudulent or contain errors. Some organizations perform random audits on claims applications and then classify them as fraudulent or not. Under these circumstances, predictive models can be constructed that attempt to correctly classify new claims; logistic, discriminant, rule induction and neural networks have been used for this purpose. However, when such an outcome field is not available, fraud detection then involves searching for and identifying exceptional instances. Here, an exceptional instance is one that the model predicts poorly. We are using regression to build the model; if there were reason to believe the model were more complex (for example, contained nonlinear relations or complex interactions), then a neural network or rule induction model could be applied. We begin by opening the stream.

Click File…Open Stream and move to the c:\Train\ClemPredModel directory Double-click on LinearRegress.str Double-click on the Type node

Figure 4.2 Type Node for Claims Data

We will develop a regression equation predicting claims amount based on hospital length of stay, severity of illness group and age. Note that the severity of illness field (ASG) is of type range,


Predictive Modeling With Clementine although it has only the three integer values 0, 1, and 2. We will leave it as range since these values fall on an ordered scale (higher values indicate greater severity). If we wished to treat it as a symbolic field, we would create dummy (0,1) fields using the SetToFlag node and use all but one (for g groups, there are g-1 non-redundant dummy fields) of the dummy fields (declared as type range) as inputs to the Regression node.

Close the Type node dialog Double-click on the Regression node (named CLAIM)

Figure 4.3 Linear Regression Model Dialog

Simple options include whether a constant (intercept) will be used in the equation and the Method of input field selection. By default (Enter), all inputs will be included in the linear regression equation. With such a small number of predictor fields, we will simply add them all into the model together. However, in the common situation of many input fields (most insurance claim forms would contain far more information) a mechanism to select the most promising predictors is desirable. This could be based on the domain knowledge of the business expert (here perhaps a health administrator). In addition, an option may be chosen to select, from a larger set of independent variables, those that in some statistical sense are the best predictors (Stepwise Method). In the stepwise method, the best input field (according to a statistical criterion) is entered into the prediction equation. Then the next best input field is entered, and so on, until a point is reached when no further input fields meet the criterion. The stepwise method includes a check to insure that the fields entered into the equation before the current step still meet the statistical criterion when the additional inputs are added. Variations on the stepwise method (Forward—inputs are added one by one, as described above, but are never removed; Backward—all inputs are entered, then the least significant input is removed, and this process is repeated until only statistically significant inputs remain) are available as well.

Click the Fields tab


Predictive Modeling With Clementine Figure 4.4 Regression Fields Tab

The weighted least squares option (Use weight field check box) supports a form of regression in which the variability of the output field is different for different values of the input fields; an adjustment can be made for this if an input field is related to this degree of variation. In practice, this option is rarely used. We see here the option to specify a partition field when there is such a field (but it doesn’t have the default name of Partition).



Predictive Modeling With Clementine Figure 4.5 Expert Options (Missing Values and Tolerance)

By default, the Regression node will only use records with valid values on the input and output fields (this is often called listwise deletion). This option can be checked off, in which case Clementine will attempt to use as much information as possible to estimate the Regression model, including records where some of the fields have missing values. It does this through a method called pairwise deletion of missing values. However, we recommend against using this option unless you are a very experienced user of regression; using incomplete records in this manner can lead to computational problems in estimating the regression equation. Instead, if there is a large amount of missing data, you may wish to substitute valid values for the missing data before using the Regression node. The Singularity tolerance will not allow an input field in the model unless at least .0001 (.01 %) of its variation is independent of the other predictors. This prevents the linear regression model estimation from failing due to multicollinearity (linear redundancy in the predictors). Most analysts would recommend increasing the default tolerance value to at least .05, though.

Click the Model tab, and then click Stepwise on the Method drop-down list Click the Expert tab, and then click the Stepping button

Figure 4.6 Stepping Criteria and Tolerance Expert Options


Predictive Modeling With Clementine You control the criteria used for input field entry and removal from the model. By default, an input field must be statistically significant at the .05 level for entry and will be dropped from the model if its significance value increases above .1.

Click Cancel Click Output button

Figure 4.7 Advanced Output Options

These options control how much supplementary information concerning the regression analysis displays. The results will appear in the Advanced tab of the generated model node in HTML format. Confidence bands (95%) for the estimated regression coefficients can be requested (Confidence interval). Summaries concerning relationships among the inputs can be obtained by requesting their Covariance matrix or Collinearity Diagnostics. The latter are especially useful when you need to identify the source and assess the level of redundancy in the predictors. Part and partial correlations measure the relationship between an input and the output field, controlling for the other inputs. Descriptive statistics (Descriptives) include means, standard deviations, and correlations; these summaries can also be obtained from the Statistics or Data Audit node. The Durbin-Watson statistic can be used when running regression on time series data and evaluates the degree to which adjacent residuals are correlated (regression assumes residuals are uncorrelated).

Click Cancel Click the Simple option button Click Model tab, then click Enter on the Method drop-down list Click Execute button Add the Regression generated model node (named CLAIM) to the Stream canvas Connect the Type node to the Regression generated model node Edit the Regression generated model node Click the Summary tab, and then expand the Analysis summary


Predictive Modeling With Clementine Figure 4.8 Linear Regression Browser Window (Analysis Summary)

This Analysis summary contains only the equation relating the predictor fields to the output. We could interpret the coefficients here, but since we don’t know whether they are statistically significant or not, we will postpone this until we examine additional information in the Advanced tab. To reach the more detailed results:

Click the Advanced tab Increase the size of the window to see more of the output

The advanced output is formatted in HTML. After listing the dependent (output) and independent (input) fields, Regression provides several measures of how well the model fits the data. First is the multiple R, which is a generalization of the correlation coefficient. If there are several input fields (our situation) then the multiple R represents the unsigned (positive) correlation between the output and the optimal linear combination of the input fields. Thus the closer the multiple R is to 1, the better the fit. As mentioned earlier, the r-square measure can be interpreted as the proportion of variance of the output that can be predicted from the input field(s). Here it is about 32% (.318), which is far from perfect prediction, but still substantial. The adjusted r-square represents a technical improvement over the r-square in that it explicitly adjusts for the number of input fields and sample size, and as such is preferred by many analysts. Generally the two r-square values are very close in value; in fact, if they differ dramatically in multiple regression, it is a sign that you have used too many inputs relative to your sample size, and the adjusted r-square value should be more trusted. In our results, they are very close.


Predictive Modeling With Clementine Figure 4.9 Model Summary and Overall Significance Tests

While the fit measures indicate how well we can expect to predict the output or how well the line fits the data, they do not tell whether there is a statistically significant relationship between the output and input fields. The analysis of variance table presents technical summaries (sums of squares and mean square statistics), but here we refer to variation accounted for by the prediction equation. We are interested in determining whether there is a statistically significant (non-zero) linear relation between the output and the input field(s) in the population. Since our analysis contains three input fields, we test whether any linear relation differs from zero in the population from which the sample was taken. The significance value accompanying the F test gives us the probability that we could obtain one or more sample slope coefficients (which measure the straight-line relationships) as far from zero as what we obtained, if there were no linear relations in the population. The result is highly significant (significance probability less than .0005—the table value is rounded to .000—or 5 chances in 10,000). Now that we have established that there is a significant relationship between the claim amount and one or more input fields and obtained fit measures, we turn to interpret the regression coefficients. Here we are interested in verifying that several expected relationships hold: (1) claims will increase with length of stay, (2) claims will increase with increasing severity of illness, and (3) claims will increase with age. Strictly speaking, this step is not necessary in order to identify cases that are exceptional. However, in order to be confident in the model, it should make sense to a domain expert in hospital claims. Since interpretation of regression models can be made directly from the regression coefficients, we turn to those next.


Predictive Modeling With Clementine Figure 4.10 Regression Coefficients

The second column contains a list of the input fields plus the intercept (Constant). The estimated coefficients in the B column are those we saw when we originally browsed the Linear Regression generated model node; they are now accompanied by supporting statistical summaries. Although the B coefficient estimates are important for prediction and interpretive purposes, analysts usually look first to the t test at the end of each row to determine which input fields are significantly related to the output field. Since three inputs are in the equation, we are testing if there is a linear relationship between each input field and the output field after adjusting for the effects of the two other inputs. Looking at the significance values (Sig.) we see that all three predictors are highly significant (significance values are .004 or less). If any of the fields were found to be not significant, you would typically rerun the regression after removing these input field(s). The column labeled B contains the estimated regression coefficients we would use to deploy the model via a prediction equation. The coefficient for length of stay indicates that on average, each additional day spent in the hospital was associated with a claim increase of about $1,106. The coefficient for admission severity group tells us that each one-unit increase in the severity code is associated with a claim increase of $417. Finally, the age coefficient of about –$33 suggests that claims decrease, on average, by $33 as patient age increases one year. This is counterintuitive and should be examined by a domain expert (here a physician). Perhaps the youngest patients are at greater risk or perhaps the type of insurance policy, which is linked somehow to age, influences the claim amount. If there isn’t a convincing reason for this negative association, the data values for age and claims should be examined more carefully (perhaps data errors or outliers are influencing the results). Such oddities may have shown up in the original data exploration. We will not pursue this issue here, but it certainly would be done in practice. The constant or intercept of $3,027 indicates that the amount of predicted claim of someone with 0 days in the hospital, in the least severe illness category (0) and with age 0. This is clearly impossible. This odd result stems in part from the fact that no one in the sample had less than 1 day in the hospital (it was an inpatient procedure) and the patients were adults (no ages of 0), so the intercept projects well beyond where there are any data. Thus the intercept cannot represent an actual patient, but still is needed to fit the data. Also, note that when using regression it can be risky to extrapolate beyond where the data are observed, since the assumption is that the same pattern continues. Here it clearly cannot!


Predictive Modeling With Clementine The Std. Error (of B) column contains standard errors of the estimated regression coefficients. These provide a measure of the precision with which we estimate the B coefficients. The standard errors can be used to create a 95% confidence band around the B coefficients (available as a Expert Output option). In our example, the regression coefficient for length of stay is $1,106 and the standard error is about $104. Thus we would not be surprised if in the population the true regression coefficient were $1,000 or $1,200 (within two standard errors of our sample estimate), but it is very unlikely that the true population coefficient would be $300 or $2,000. Betas are standardized regression coefficients and are used to judge the relative importance of each of several input fields. They are important because the values of the regression coefficients (Bs) are influenced by the standard deviations of the input fields and their scale, and the beta coefficients adjust for this. Here, not surprisingly, length of stay is the most important predictor of claim amount, followed by severity group and age. Betas typically range from –1 to 1 and the further from 0, the more influential the predictor variable. Thus if we wish to predict claims based on length of stay, severity code and age, the formula would use the estimated B coefficients:

Predicted Claims = $3,027 + $1,106 * length of stay + $417 * severity code – $33 * age

Points Poorly Fit by Model The motivation for this analysis is to detect errors or possible fraud by identifying cases that deviate substantially from the model. As mentioned earlier, these need not be the result of errors or fraud, but they are inconsistent with the majority of cases and thus merit scrutiny. We first create a field that stores the residuals, or errors in prediction, which we will then sort and display in a table.

Close the Regression generated model node Place a Derive node from the Field Ops palette to the right of the Regression generated

model node Connect the Regression generated model node to the Derive node Edit the Derive node Enter the new field name DIFF into the Derive field: text box Enter the formula CLAIM – ‘$E-CLAIM’ into the Formula text box


Predictive Modeling With Clementine Figure 4.11 Computing an Error (Residual) Field

The DIFF field measures the difference between the actual claim value (CLAIM) and the claim value predicted by the model ($E-CLAIM). Since we are most interested in the large positive errors, we will sort the data by DIFF before displaying it in a table.

Click OK to complete the Derive node Place a Sort node to the right of the Derive node Connect the Derive node to the Sort node Edit the Sort node Select DIFF as the Sort by field Select Descending in the Order column Click OK to process the Sort request Place a Table node to the right of the Sort node Connect the Sort node to the Table node Execute the Table node


Predictive Modeling With Clementine Figure 4.12 Errors Sorted in Descending Order

There are two records for which the claim values are much higher than the regression prediction. Both are about $6,000 more than expected from the model. These would be the first claims to examine more carefully. We could also examine the last few records for large over-predictions, which might be errors as well.



Summary Exercises


Exercise Data Information Sheets The exercises in this chapter is written around the data files InsClaims.dat. The following section give details of the file. InsClaim.dat contains insurance claim information from patients in a hospital. All patients were in the same diagnostic related group (DRG) category. A few fields containing patient information are included. Interest is in building a prediction model of total charges based on patient information and then identifying exceptions to the model (error or fraud detection). The file contains about 300 records and the following fields: ASG Severity of illness code (higher values mean more seriously ill) AGE Age LOS Length of hospital stay (in days) CLAIM Total charges in US dollars (total amount claimed on form)

1. Using the insurance claims data, use the Stepwise method and compare the equation to the one obtained using the Enter method. Are you surprised by the result? Why or why not? Try the Forward and Backward methods. Do you find any differences?

2. Instead of examining errors in the original scale, analysts may prefer to express the

residual as a percent deviation from the prediction. Such a measure may be easier to communicate to a wider audience. Add a Derive node that calculates a percent error. Name this field PERERROR and use the following formula: 100* (CLAIM – '$E-CLAIM')/'$E-CLAIM'. Compare this measure of error to the original DIFF. Do the same records stand out? What conditions is this percent error most sensitive to? Use the Histogram node to produce histograms for either of the error fields, generate a Select node to select records with large errors, and then display them in a table.

3. Use the Neural Net modeling node to predict CLAIM using a neural network. How does

its performance compare to linear regression? What does this suggest about the model? Fit a C&R Tree model and make the same comparison. Examine the errors from the better of these latter models (as you judge them). Do the same records consistently display large errors?



Chapter 5: Logistic Regression

Objectives • Review the concepts of logistic regression • Use the technique to model credit risk

Data A risk assessment study in which customers with credit cards were assigned to one of three categories: good risk, bad risk-profitable (some payments missed or other problems, but were profitable for the issuing company), and bad risk-loss. In addition to the risk classification field, a number of demographics, including age, income, number of children, number of credit cards, number of store credit cards, having a mortgage, and marital status, are available for about 2,500 records.

5.1 Introduction to Logistic Regression Logistic regression, unlike linear regression, develops a prediction equation for a symbolic or set output field that contains two or more unordered categories (the categories could be ordered, but logistic regression does not take the ordering into account). Thus it can be applied to such situations as:

• Predicting which brand (of the major brands) of personal computer an individual will purchase

• Predicting whether or not a customer will close her account, accept an offering, or switch providers

Logistic regression technically predicts the probability of an event (of a record being classified into a specific category of the outcome field). The logistic function is shown in Figure 5.1. Suppose that we wish to predict whether someone buys a product. The function displays the predicted probability of purchase based on an incentive.

Logistic Regression 5 - 1

Predictive Modeling With Clementine Figure 5.1 Logistic Model for Probability of Purchase

We see the probability of making the purchase increases as the incentive increases. Note that the function is not linear but rather S-shaped. The implication of this is that a slight change in the incentive could be effective or not depending on the location of the starting point. A linear model would imply that a fixed change in incentive would always have the same effect on probability of purchase. The transition from low to high probability of purchase is quite gradual. However, with a logistic model the transition can occur much more rapidly (steeper slope) near the .5 probability value. To understand how the model functions, we need to review some equations. The logistic model makes predictions based on the probability of an outcome. Binary (two outcome category) logistic regression can be formulated as:

kk

kk

XBXBXB

XBXBXB

ee

++++

++++

+ ...

...

2211

2211

1 = )Prob(event α

α

Where X1, X2, …, Xk are the input fields. This can also be expressed in terms of the odds of the event occurring.

kkXBXBXBe ++++=−

= ...2211

event) (no Prob(event) Probor

(event) Prob 1(event) Prob(event) Odds α

where the outcome is one of two categories (event, no event). If we take the natural log of the odds, we have a linear model, akin to a standard regression equation:

kk XBXBXB ++++ ... = (event)) (Oddsln 2211α With two output categories, a single odds ratio summarizes the outcome. However, when there are more than two output categories, ratios of the category probabilities can still describe the outcome, but additional ratios are required. For example, in the credit risk data used in this chapter there are three outcome categories: good risk, bad risk–profit, and bad risk–loss. Suppose we take the Good Risk category as the reference or baseline category and assign integer codes to


Predictive Modeling With Clementine the outcome categories for identification: (1) Bad Risk–Profit, (2) Bad Risk–Loss, (3) Good Risk. For the three categories we can create two probability ratios:

Risk) (Good Prob

Profit)-Risk (Bad Prob (3)(1) g(1) ==

ππ

and

Risk) (Good Prob

Loss)-Risk (Bad Prob(3)(2) g(2) ==

ππ

Where (j)π is the probability of being in outcome category j. Each ratio is based on the probably of an output category divided by the probability of the reference or baseline outcome category. The remaining probability ratio (Bad Risk–Profit / Bad Risk–Loss) can be obtained by taking the ratio of the two ratios shown above. Thus the information in J outcome categories can be summarized in (J-1) probability ratios. In addition, these outcome-category probability ratios can be related to input fields in a fashion similar to what we saw in the binary logistic model. Again using the Good Risk output category as the reference or baseline category, we have the following model:

kk XBXBXB 12121111 ... ) Risk) (Good Prob

Profit)-Risk (Bad Probln()(3)(1)ln( ++++== α

ππ

and

kk XBXBXB 22221212 ... ) Risk) (Good Prob

Loss)-Risk (Bad Probln()(3)(2)ln( ++++== α

ππ

Notice that there are two sets of coefficients for the three-category output case, each describing the ratio of an output category to the reference or baseline category. If we complete this logic and create a ratio containing the baseline category in the numerator, we would have:

kk XBXBXB 32321313 ...

0 ln(1) ) Risk) (Good ProbRisk) (Good Probln()

(3)(3)ln(

++++=

===

αππ

This implies that the coefficients associated with )(3)(3)ln(

ππ

are all 0 and so are not of interest.

Also, the ratio relating any two output categories, excluding the baseline, can be easily obtained by subtracting their respective natural log expressions. Thus:

)(3)(2)ln()

(3)(1)ln( )

(2)(1)ln(

ππ

ππ

ππ

−= , or



) Risk) (Good Prob

Loss) -Risk (Bad Probln( - ) Risk) (Good Prob

Profit)-Risk (Bad Probln( ) Loss) -Risk (Bad ProbProfit)-Risk (Bad Probln( =

We are interested in predicting the probability of each output category for specific values of the predictor variables. This can be derived from the expressions above. The probability of being in outcome category j is:

)(

(j) (j)

1∑=

= J

i

ig

gπ , where J is the number of output categories.

In our example with the three risk output categories, for outcome category (1):

1(1)

(3) (2) (1)(1)

(3)(3)

(3)(2)

(3)(1)

(3)(1)

(3))2()1(

(1) ππππ

π

ππ

ππ

ππ

ππ

=++

=++

=++ ggg

g

And substituting for the g(j)’s, we have an equation relating the predictor variables to the output category probabilities.

kkkkkk

kk

XBXBXBXBXBXBXBXBXB

XBXBXB

eeee

323213132222121212121111

12121111

.........

...

(1) ++++++++++++

++++

++= ααα

α

π

1

2222121212121111

12121111

......

...

++= ++++++++

++++

kkkk

kk

XBXBXBXBXBXB

XBXBXB

eee

αα

α

In this way, the logic of binary logistic regression can be naturally extended to permit analysis of symbolic output fields with more than two categories.

5.2 A Multinomial Logistic Analysis: Predicting Credit Risk We will perform a multinomial logistic analysis that attempts to predict credit risk (three categories) for individuals based on several financial and demographic input fields. We are interested in fitting a model, interpreting and assessing it, and obtaining a prediction equation. Possible input fields are shown below.



Field name Field description AGE age in years INCOME income (in thousands of British pounds) GENDER f=female, m=male MARITAL marital status: single, married, divsepwid (divorced, separated or

widowed) NUMKIDS number of dependent children NUMCARDS number of credit cards HOWPAID how often paid: weekly, monthly MORTGAGE have a mortgage: y=yes, n=no STORECAR number of store credit cards LOANS number of other loans INCOME1K income (in thousands of British pounds) divided by 1,000

The output field is:

Field name Field description RISK credit risk: 1= bad risk-loss, 2=bad risk-profit, 3= good risk

To access the data:

Click File…Open Stream and move to the c:\Train\ClemAdvMod directory Double-click on Logistic.str Execute the Table node, examine the data, and then close the Table window Double-click on the Type node

The output field is credit risk (RISK). Notice that only four input fields are used. This is done to simplify the results for this presentation. As an exercise, the other fields will be used as predictors.


Predictive Modeling With Clementine Figure 5.2 Type Node for Logistic Analysis

Close the Type node dialog Double-click on the Logistic Regression model node named RISK


Predictive Modeling With Clementine Figure 5.3 Logistic Regression Dialog

In the Model tab, you can choose whether a constant (intercept) is included in the equation. The Procedure option is used to specify whether a binomial or multinomial model is created. The options that will be available to you in the dialog box will differ according to which modeling procedure you select. Binomial is used when the target field is a flag or set with two discreet values, such as good risk/bad risk, or churn/not churn. Whenever you use this option, you will in addition be asked to declare which of your flag or set fields should be treated as categorical, the type of contrast you want performed, and the reference category for each predictor. The default contrast is Indicator, which indicates the presence or absence of category membership. However, in fields with some implicit order, you may want to use another contrast such as Repeated, which compares each category with the one that precedes it. The default reference or base category is the First category. If you prefer, you can change this to the Last category. Multinomial should be used when the target field is a set field with more than two values. This is the correct choice in our example because the RISK field has three values: bad risk, bad profit, and good risk. Whenever you use this option, the Model type option will become available for you to specify whether you want a main effects model, a full factorial model, or a custom model. By default, a model including the main effects (no interactions) of factors (symbolic inputs) and covariates (numeric inputs) will be run. This is similar to what the Regression model node will do (unless interaction terms are formally added). The Full factorial option would fit a model


Predictive Modeling With Clementine including all factor interactions (in our example, with two symbolic predictors, the two-way interaction of MARITAL and MORTGAGE would be added). Notice that there are Method options (as there were for linear regression), so stepwise methods can be used when the Main Effects model type is selected. When a number of input fields are available, the stepwise methods provide a method of input field selection based on statistical criteria. The Base Category for target option is used to specify the reference category. The default is the First category in the list, which in this case is bad loss. Note: This field is unavailable if the contrast setting is Difference, Helmert, Repeated, or Polynomial.

Select the Multinomial Procedure option (if necessary) Click on the Specify button to the right of Base category for target. This will open the Insert Value dialog box Click on good risk

Figure 5.4 Insert Value Dialog

Click the Insert button This will change the base target category. The result is shown in Figure 5.5.


Predictive Modeling With Clementine Figure 5.5 Logistic Regression Dialog with Good Risk as the Base Target Category

Click on the Expert tab Click the Expert Mode option button


Predictive Modeling With Clementine Figure 5.6 Logistic Expert Mode Options

The Scale option allows adjustment to the estimated parameter variance-covariance matrix based on over-dispersion (variation in the outcome greater than expected by theory, which might be due to clustering in the data). The details of such adjustment are beyond the scope of this course, but you can find some discussion in McCullagh and Nelder (1989). If the Append all probabilities checkbox is selected, predicted probabilities for every category of the output field would be added to each record passed through the generated model node. If not selected, a predicted probability field is added only for the predicted category.

Click the Output button Click the Likelihood ratio tests check box Click the Classification table check box

By default, summary statistics and (partial) likelihood ratio tests for each effect in the model appear in the output. Also, 95% confidence bands will be calculated for the parameter estimates. We have requested a classification table so we can assess how well the model predicts the three risk categories.


Predictive Modeling With Clementine Figure 5.7 Logistic Regression Advanced Output Options

In addition, a table of observed and expected cell probabilities can be requested (Goodness of fit chi-square statistics). Note that, by default, cells are defined by each unique combination of a covariate (range input) and factor (symbolic input) pattern, and a response category. Since a continuous predictor (INCOME1K) is used in our analysis, the number of cell patterns is very large and each might have but a single observation. These small counts could possibly yield unstable results, and so we will forego goodness of fit statistics. The asymptotic correlation of parameter estimates can provide a warning for multicollinearity problems (when high correlations are found among parameter estimates). Iteration history information is requested to help debug problems if the algorithm fails to converge, and the number of iteration steps to display can be specified. Monotonicity measures can be used to find the number of concordant pairs, discordant pairs, and tied pairs in the data, as well as the percentage of the total number of pairs that each represents. The Somers' D, Goodman and Kruskal's Gamma, Kendall's tau-a, and Concordance Index C are also displayed in this table. Information criteria shows Akaike’s information criterion (AIC) and Schwarz’s Bayesian information criterion (BIC).

Click OK Click Convergence button

Figure 5.8 Logistic Regression Convergence Criteria


Predictive Modeling With Clementine The Logistic Regression Convergence Criteria options control technical convergence criteria. Analysts familiar with logistic regression algorithms might use these if the initial analysis fails to converge to a solution.

Click Cancel Click Execute Browse the Logistic Regression generated model node named RISK in the Models

Manager window Click the Advanced tab, and then expand the browsing window

The advanced output is displayed in HTML format. Figure 5.9 Record Processing Summary

The marginal frequencies of the symbolic inputs and the output are reported, along with a summary of the number of valid and missing records. A record must have valid values on all inputs and the output in order to be included in the analysis. We have nearly 2,500 records for the analysis.


Predictive Modeling With Clementine Figure 5.10 Model Fit and Pseudo R-Square Summaries

The Final model chi-square statistic tests the null hypothesis that all model coefficients are zero in the population, equivalent to the overall F test in regression. It has ten degrees of freedom that correspond to the parameters in the model (seen below), is based on the change in –2LL (–2 log likelihood) from the initial model (with just the intercept) to the final model, and is highly significant. Thus at least some effect in the model is significant. The AIC and BIC fit measures are also displayed. The model fit improves as these two values approach zero. Because each of them decreased, we can conclude that the model fit improved with the addition of the predictors. Pseudo r-square measures try to measure the amount of variation (as functions of the chi-square lack of fit) accounted for by the model. The model explains only a modest amount of the variation (the maximum is 1, and some measures cannot reach this value). Figure 5.11 Likelihood Ratio Tests

The Model Fitting Criteria table provided an omnibus test of effects in the model. Here we have a test of significance for each effect (in this case the main effect of an input field) after adjusting for the other effects in the model. The caption explains how it is calculated. All effects are highly significant. Notice that the intercepts are not tested in this way, but tests of the individual intercepts can be found in the Parameter Estimates table. In addition, we can use this table to rank order the importance of the predictors. For instance, if we focus on the -2 LL value, if


Predictive Modeling With Clementine INCOME1K was removed as a predictor, the -2 LL value would increase by a magnitude 302.422. Clearly, the removal of this predictor would have far more impact on the overall fit than if we were to eliminate any of the other predictors. The further -2LL gets from zero, the worse the fit. Thus, we can conclude that INCOME1K is the most important predictor, followed by MARITAL, NUMKIDS, and MORTGAGE. For those familiar with binary (two output category) logistic regression, note that the values in the df (degrees of freedom) column are double what you would expect for a binary logistic regression model. For example, the covariate income (INCOME1K), which is continuous, has two degrees of freedom. This is because with three outcome categories, there are two probability ratios to be fit, doubling the number of parameters. Income has by far the largest chi-square value compared to the other predictors with two (or even four) degrees of freedom.

5.3 Interpreting Coefficients The most striking feature of the Parameter Estimates table is that there are two sets of parameters. One set is for the probability ratio of “bad risk–loss” to “good risk,” which is labeled “bad loss.” The other set is for the probability ratio of “bad risk–profit” to “good risk,” labeled “bad profit.” You can view the estimates in equation form in the Model tab, but the Advanced tab contains more supplementary information. Figure 5.12 Parameter Estimates

For each of the two outcome probability ratios, each predictor is listed, plus an intercept, with the B coefficients and their standard errors, a test of significance based on the Wald statistic, and the


Predictive Modeling With Clementine Exp(B) column, which is the exponentiated value of the B coefficient, along with its 95% confidence interval. As with ordinary linear regression, these coefficients are interpreted as estimates for the effect of a particular input, controlling for the other inputs in the equation. Recall that the original (linear) model is in terms of the natural log of a probability ratio. The intercept represents the log of the expected probability ratio of two outcome categories when all numeric inputs are zero and all symbolic fields are set to their reference category (last group) values. For covariates, the B coefficient is the effect of a one-unit change in the input on the log of the probability ratio. Examining income (INCOME1K) in the “bad loss” section, an increase of 1 unit (equivalent to 1,000 British pounds) decreases the log of the probability ratio between “bad loss” and “good risk” by –.056. But what does this mean in terms of probabilities? Moving to the Exp(B) column, we see the value is .945 for INCOME1K (in the “bad loss” section of the table). Thus increasing income by 1 unit (or 1,000 British pounds) decreased the expected ratio of the probability of being a bad loss to the probability of being a good risk by a factor of .945. In other words, increasing income reduces the expected probability of being a “bad loss” relative to being a “good risk,” and this reduction is .945 per 1,000 British pounds. This finding makes common sense. If we examine the income coefficient in the “bad profit” section of the table, we see that in a similar way (Exp(B) = .878) the expected probability of being a “bad profit” relative to being a good risk decreases as income increases. Thus increasing income, after controlling for the other variables in the equation, is associated with decreasing the probability of having a “bad loss” or “bad profit” outcome relative to being a “good risk.” This relationship is quantified by the values in the Exp(B) column and the Sig column indicates that both coefficients are statistically significant. Turning to the number of children (NUMKIDS), we see that its coefficient is significant for the “bad loss” ratio, but not the “bad profit” ratio. Examining the Exp(B) column for NUMKIDS in the “bad loss” section, the coefficient estimate is 2.267. For each additional child (one unit increase in NUMKIDS), the expected ratio of the probability of being a “bad loss” to being a “good risk” more than doubles. Thus, controlling for other predictors, adding a child (one unit increase) doubles the expected probability of being a “bad loss” relative to a “good risk.” However, controlling for the other predictors, the number of children has no significant effect on the probability ratio of being a “bad profit” relative to a “good risk.” The Logistic node uses a General Linear Model coding scheme. Thus for each symbolic input (here MARITAL and MORTGAGE), the last category value is made the reference category and the other coefficients for that input are interpreted as offsets from the reference category. In examining the table we see that the last categories for MARITAL (single) and MORTGAGE (y) have B coefficients fixed at 0. Because of this the coefficient of any other category can be interpreted as the change associated with shifting from the reference category to the category of interest, controlling for the other input fields. Since the reference category coefficients are fixed at 0, they have no associated statistical tests or confidence bands. Looking at the MARITAL input field, its two coefficients (for divsepwid and married categories) are significant for both the “bad loss” and “bad profit” summaries. In the “bad loss” section, we see the estimated Exp(B) coefficient for the “MARITAL=divsepwid” category is .284, while that for “MARITAL=married” is 2.891. Thus we could say that, after controlling for other inputs, compared to those who are single, those who are divorced, separated or widowed have a large reduction (.284) in the expected ratio of the probability of being a “bad loss” relative to a “good risk.” Put another way, the divorced, separated or widowed group is expected to have fewer “bad losses” relative to “good risks” than is the single group. On the other hand, the married group is


Predictive Modeling With Clementine expected to have a much higher (by a factor of almost 3) proportion of “bad losses” relative to “good risks” than the single group. The explanation of why being married versus single should be associated with an increase of “bad losses” relative to “good risks” should be worked out by the analyst, perhaps in conjunction with someone familiar with the credit industry (domain expert). If we examine the MARITAL Exp(B) coefficients for the “bad profit” ratios, we find a very similar result. Finally, MORTGAGE is significant for both the “bad loss” and “bad profit” ratios. Since having a mortgage (coded y) is the reference category, examining the Exp(B) coefficients shows that compared to the group with a mortgage, those without a mortgage have a greater expected probability of being “bad losses” (1.828) or “bad profits” (2.526) relative to “good risks.” In short, those without mortgages are less likely to be good risks, controlling for the other predictors. In this way, the statistical significance of inputs can be determined and the coefficients interpreted. Note that if a predictor were not significant in the Likelihood Ratio Tests table, then the model should be rerun after dropping the variable. Although NUMKIDS is not significant for both sets of category ratios, the joint test (Likelihood Ratio Test) indicates it is significant and so we would retain it.

Classification Table The classification table, sometimes called a misclassification table or confusion matrix, provides a measure of how well the model performs. With three output categories we are interested in the overall accuracy of model classification, the accuracy for each of the individual output categories, and patterns in the errors. Figure 5.13 Classification Table

The rows of the table represent the actual output categories while the columns are the predicted output categories. We see that overall, the predictive accuracy of the model is 62.4%. Although marginal counts do not appear in the table, by adding the counts within each row we find that the most common output category is bad profit (1,475). This constitutes 60.1% percent of all cases (2,455). Thus the overall predictive accuracy of our model is not much of an improvement over the simple rule of always predicting “bad risk–profit.” However, we should recall that this simple rule would never make a prediction of “bad risk–loss” or “good risk.” In examining the individual output categories, the “bad risk–profit” group is predicted most accurately (87.3%), while the other categories, “bad risk–loss” (15.9%) and “good risk” (36.8%) are predicted with much less accuracy. Not surprisingly, most errors in prediction for these latter two output categories are predicted to be “bad risk–profit.”


Predictive Modeling With Clementine The classification table allows us to evaluate a model from the perspective of predictive accuracy. Whether this model would be adequate depends in part on the value of correct predictions and the cost of errors. Given the modest improvement of this model over simply classifying all cases as “bad risk–profit,” in practice an analyst would see if the model could be improved by adding additional predictors and perhaps some interaction terms. Finally, it is important to note that the predictions were evaluated on the same data used to fit the model and for this reason may be optimistic. A better procedure is to keep a separate validation sample on which to evaluate the predictive accuracy of the model.

Making Predictions We now have the estimated model coefficients. How does the Logistic generated model node make predictions from the model? First, let’s see the actual predictions by adding the generated model to the stream.

Close the Model browsing window Add the Logistic generated model to the stream Connect the generated model to the Type node Add a Table node to the stream and connect the Logistic generated model to the Table

node Execute the Table node

Figure 5.14 Predicted Value and Probability from Logistic Model


Predictive Modeling With Clementine The field $L-RISK contains the most likely prediction from the model (here “good risk”). The probabilities for all three outcomes must sum to 1; the model prediction is the outcome category with the highest probability. That probability is contained in the field $LP-RISK. So for the first case, the prediction is “good risk” and the predicted probability of this occurring is .692 for this combination of input values. You prefer that the probability be as close to 1 as possible (the lowest possible value for the predicted category is .333; Why?). To illustrate how the actual calculation is done, let’s take an individual who is single, has a mortgage, no children, and has an income of 35,000 British pounds (INCOME1K = 35.00). What is the predicted probability of her (although gender was not included in the model) being in each of the three risk categories? Into which risk category would the model place her? Earlier in this chapter we showed the following (where (j)π is the probability of being in outcome category j):

)(

(j) (j)

1∑=

= J

i

ig

gπ , where J is the number of output categories

If we substitute the parameter estimates in order to obtain the estimated probability ratios, we have:

121 *603.*062.1*260.1*818.1*056.438.(1)ˆ MortgageMaritalMaritalNumkidskIncomeeg ++−+−=

121 *927.*021.1*220.1*153.1*130.285.4(2)ˆ MortgageMaritalMaritalNumkidskIncomeeg ++−+−= and

1(3)ˆ =g Where because of the coding scheme for the symbolic inputs (Factors): Marital1 = 1 if Marital=divsepwid; 0 otherwise Marital2 = 1 if Marital=married; 0 otherwise Mortgage1 = 1 if Mortgage=n; 0 otherwise Thus for our hypothetical individual, the estimated probability ratios are:

218.(1)ˆ 522.1.0*603.0*062.10*260.10*818.0.35*056.438. === −++−+− eeg

767.(2)ˆ 265.0*927.0*021.10*220.10*153.0.35*130.285.4 === −++−+− eeg

1(3)ˆ =g And the estimated probabilities are:



.110 1767.218.

218. (1)ˆ =++

=π

.386 1767.218.

767. (2)ˆ =++

=π

.504 1767.218.

1 (3)ˆ =++

=π

Since the third group (good risk) has the greatest expected probability (.504), the model predicts that the individual belongs to that group. The next most likely group to which the individual would be assigned would be group 2 (bad risk–profit) because its expected probability is the next largest (.386).

Additional Readings Those interested in learning more about logistic regression might consider David W Hosmer and Stanley Lemeshow’s Applied Logistic Regression, 2nd Edition, New York, Wiley, 2000.



Summary Exercises A Note Concerning Data Files In this training guide files are assumed to be located in the c:\Train\ClemPredModel directory.

Exercise Data Information Sheets The exercises in this chapter is written around the data files RiskTrain.txt. The following section give details of the file. RiskTrain.txt contains information from a risk assessment study in which customers with credit cards were assigned to one of three categories: good risk, bad risk-profitable (some payments missed or other problems, but profitable for the issuing company), and bad risk-loss. In addition to the risk classification field, a number of demographics are available for about 2,500 cases. Interest is in predicting credit risk from the demographic fields. The file contains the following fields: ID ID number AGE Age INCOME Income in British pounds GENDER Gender MARITAL Marital status NUMKIDS Number of dependent children NUMCARDS Number of credit cards HOWPAID How often is customer paid by employer (weekly, monthly) MORTGAGE Does customer have a mortgage? STORECAR Number of store credit cards LOANS Number of outstanding loans RISK Credit risk category INCOME1K Income in thousands of British pounds (field derived within Clementine)

1. Continuing with the stream from the chapter, add the other available inputs, excluding income (which is linearly related to income1k), and ID, to a logistic regression model and evaluate the results. Do the additional variables substantially improve the predictive accuracy of the model? Examine the estimated coefficients for the significant inputs. Do these relationships make sense?

2. Rerun the Logistic node, dropping those inputs that were not significant in the last

analysis. Does the accuracy of the model change much? Does the interpretation of any of the coefficients change substantially?

3. Rerun the Logistic node, this time using the Stepwise Method. Do the input fields

selected match those retained in Exercise 2?



4. Run a rule induction model (using C5.0 or CHAID) on this data, using all fields but ID and income as inputs. How does the accuracy of this model compare to that found by logistic regression? What does this suggest about the relations in the data? Do the inputs used by the model correspond to the inputs that were found to be significant in the logistic regression analysis?

5. Run a neural net model on this data, again excluding ID and income as inputs. Include a

sensitivity analysis. Does the neural network outperform the other models? Are the important predictors in the neural network model (sensitivity results) the same as the significant input fields in the logistic regression?



Chapter 6: Discriminant Analysis

Objectives • How Does Discriminant Analysis Work? • The Elements of a Discriminant Analysis • The Discriminant Model • How Cases are Classified • Assumptions of Discriminant Analysis • Analysis Tips • A Two–Group Discriminant Example

Data To demonstrate discriminant analysis we take data from a study in which respondents answered, hypothetically, whether they would accept an interactive news subscription service (via cable). There is interest in identifying those segments most likely to adopt the service. Several demographic variables are available, including: education, gender, age, income (in categories), number of children, number of organizations the respondent belonged to, and the number of hours of TV watched per day. The outcome measure was whether they would accept the offering.

6.1 Introduction Discriminant analysis is a technique designed to characterize the relationship between a set of variables, often called the response or predictor variables, and a grouping variable with a relatively small number of categories. By modeling the relationship, discriminant can make predictions for categories of the grouping variable. To do so, discriminant creates a linear combination of the predictors that best characterizes the differences among the groups. The technique is related to both regression and multivariate analysis of variance, and as such it is another general linear model technique. Another way to think of discriminant analysis is as a method to study differences between two or more groups on several variables simultaneously. Common uses of discriminant include:

1. Deciding whether a bank should offer a loan to a new customer. 2. Determining which customers are likely to buy a company’s products. 3. Classifying prospective students into groups based on their likelihood of success at a

school. 4. Identifying patients who may be at high risk for problems after surgery.

6.2 How Does Discriminant Analysis Work? Discriminant analysis assumes that the population of interest is composed of separate and distinct populations, as represented by the grouping variable. The discriminant analysis grouping variable can have two or more categories. Furthermore, we assume that each population is measured on a

Discriminant Analysis 6 - 1

Predictive Modeling With Clementine set of variables—the predictors—that follow a multivariate normal distribution. Discriminant attempts to find the linear combinations of the predictors that best separate the populations. If we assume two predictor variables, X and Y, and two groups for simplicity, this situation can be represented as in Figure 6.1. Figure 6.1 Two Normal Populations and Two Predictor Variables, with Discriminant Axis

The two populations or groups clearly differ in their mean values on both the X and Y-axes. However, the linear function—in this instance, a straight line—that best separates the two groups is a combination of the X and Y values, as represented by the line running from lower left to upper right in the scatterplot. This line is a graphic depiction of the discriminant function, or linear combination of X and Y, that is the best predictor of group membership. In this case with two groups and one function, discriminant will find the midpoint between the two groups that is the optimum cutoff for separating the two groups (represented here by the short line segment). The discriminant function and cutoff can then be used to classify new observations. If there are more than two predictors, then the groups will (hopefully) be well separated in a multidimensional space, but the principle is exactly the same. If there are more than two groups, more than one classification function can be calculated, although not all the functions may be needed to classify the cases. Since the number of predictors is almost always more than two, scatterplots such as Figure 6.1 are not always that helpful. Instead, plots are often created using the new discriminant functions, since it is on these that the groups should be well separated. The effect of each predictor on each discriminant function can be determined, and the predictors can be identified that are more important or more central to each function. Nevertheless, unlike in regression, the exact effects of the predictors are not typically seen as of ultimate importance in discriminant analysis. Given the primary goal of correct prediction, the specifics of how this is accomplished are not as critical as the prediction itself (such as offering loans to customers who will pay them back). Second, as will be demonstrated below, the predictors do not directly predict the grouping variable, but instead a value on the discriminant function, which, in turn, is used to generate a group classification.


Predictive Modeling With Clementine 6.3 The Discriminant Model The discriminant model has the following mathematical form for each function:

FK = D0 + D1X1 + D2X2 + ... DpXp

where FK is the score on function K, the Di’s are the discriminant coefficients, and the Xi’s are the predictor or response variables (there are p predictors). The maximum number of functions K that can be derived is equal to the minimum of the number of predictors (p) or the quantity (number of groups – 1). In most applications, there will be more predictors than categories of the grouping variable, so the latter will limit the number of functions. For example, if we are trying to predict which customers will choose one of three offers, (3-1), or two classification functions can be derived. When more than one function is derived, each subsequent function is chosen to be uncorrelated, or orthogonal, to the previous functions (just as in principal components analysis, where each component is uncorrelated with all others, see Chapter 7). This allows for straightforward partitioning of variance. Discriminant creates a linear combination of the predictor variables to calculate a discriminant score for each function. This score is used, in turn, to classify cases into one of the categories of the grouping variable.

6.4 How Cases Are Classified There are three general types of methods to classify cases into groups.

1. Maximum likelihood or probability methods: These techniques assign a case to group k if its probability of membership is greater for group k than for any other group. These probabilities are posterior probabilities, as defined below. This method relies upon assumptions of multivariate normality to calculate probability values.

2. Linear classification functions: These techniques assign a case to group k if its score on

the function for that group is greater than its score on the function for any other group. This method was first suggested by Fisher, so these functions are often called Fisher linear discriminant functions (which is how Clementine refers to them).

3. Distance functions: These techniques assign a case to group k if its distance to that

group’s centroid is smaller than its distance to any other group’s centroid. Typically, the Mahalanobis distance is the measure of distance used in classification.

When the assumption of equal covariance matrices is met, all three methods give equivalent results. Clementine uses the first technique, a probability method based on Bayesian statistics, to derive a rule to classify cases. The rule uses two probability estimates. The prior probability is an estimate of the probability that a case belongs to a particular group when no information from the predictors is available. Prior probabilities are typically either determined by the number of cases in each category of the grouping variable, or by assuming that the prior probabilities are all equal


Predictive Modeling With Clementine (so that if there are three groups, the prior probability of each group would be 1/3). We have more to say about prior probabilities below. Second, the conditional probability is the probability of obtaining a specific discriminant score (or one further from the group mean) given that a case belongs to a specific group. By assuming that the discriminant scores are normally distributed, it is possible to calculate this probability. With this information and by applying Bayes’ rule, the posterior probability is calculated, which is defined as the likelihood or probability of group membership, given a specific discriminant score. It is this probability value that is used to classify a case into a group. That is, a case is assigned to the group with the highest posterior probability. Although Clementine uses a probability method of classification, you will most probably use a method based on a linear function to classify new data. This is mainly for ease of calculation because calculating probabilities for new data is computationally intensive compared to using a classification function. This will be illustrated below.

6.5 Assumptions of Discriminant Analysis As with other general linear model techniques, discriminant makes some fairly rigorous assumptions about the population. And as with these other techniques, it tends to be fairly robust to violations of these assumptions. Discriminant assumes that the predictor variables are measured on an interval or ratio scale. However, as with regression, discriminant is often used successfully with variables that are ordinal, such as questionnaire responses on a five- or seven-point Likert scale. Nominal variables can be used as predictors if they are given dummy coding. The grouping variable can be measured on any scale and can have any number of categories, though in practice most analyses are run with five or fewer categories. Discriminant assumes that each group is drawn from a multivariate normal population. This assumption can be and is violated often, especially as sample size increases, and moderate departures from normality are usually not a problem. If this assumption is violated, the tests of significance and the probabilities of group membership will be inexact. If the groups are widely separated in the space of the predictor variables, this will not be as critical as when there is a fair amount of overlap between the groups. When the number of categorical predictor variables is large (as opposed to interval–ratio predictors), multivariate normality cannot hold by definition. In that case, greater caution must be used, and many analysts would choose to use logistic regression instead. Most evidence indicates that discriminant often performs reasonably well with such predictors, though. Another important assumption is that the covariance matrices of the various groups are equal. This is equivalent to the standard assumption in analysis of variance about equal variances across factor levels. When this is violated, distortions can occur in the discriminant functions and the classification equations. For example, the discriminant functions may not provide maximum separation between groups when the covariances are unequal. If the covariance matrices are unequal but the variables’ distribution is multivariate normal, the optimum classification rule is the quadratic discriminant function. But if the matrices are not too dissimilar, the linear


Predictive Modeling With Clementine discriminant function performs quite well, especially when the sample sizes are small. This assumption can be tested with the Explore procedure or with the Box’s M statistic, displayed by Discriminant. For a more detailed discussion of problems with assumption violation, see P.A. Lachenbruch (Discriminant Analysis. 1975. New York: Hafner) or Carl Huberty (Applied Discriminant Analysis. 1994. New York: Wiley).

6.6 Analysis Tips In addition to the assumptions of discriminant, some additional guidelines are helpful. Many analysts would recommend having at least 10 to 20 times as many cases as predictor variables to insure that a model doesn’t capitalize on chance variation in a particular sample. For accurate classification, another common rule is that the number of cases in the smallest group should be at least five times the number of predictors. In the interests of parsimony, Huberty recommends having a goal of only 8 to 10 response variables in the final model. Although in applied work this may be too stringent, keep in mind that more is not always better. Outlying cases can affect the results by biasing the values of the discriminant function coefficients. Looking at the Mahalanobis distance for a case or examining the probabilities is normally an effective check for outliers. If a case has a relatively high probability of being in more than one group, it is difficult to classify. Analyses can be run with and without outliers to see how results are affected. Multicollinearity is less of a problem in Discriminant Analysis because the exact effect of a predictor variable is typically not the focus of an analysis. When two variables are highly correlated, it is difficult to partition the variance between them, and the coefficient estimates are often incorrect. Still, the accuracy of prediction may be little affected. Multicollinearity can be more of a problem when stepwise methods of variable selection are used, since variables can be removed from a model for reasons unrelated to that variable’s ability to separate the groups.

6.7 Comparison of Discriminant and Logistic Regression Discriminant and logistic regression have the same broad purpose: to build a model predicting which category (or group) individuals belong to based on a set of interval scale predictors. Discriminant formally makes stronger assumptions about the predictor variables, specifically that for each group they follow a multivariate normal distribution with identical population covariance matrices. Based on this you would expect discriminant to be rarely used since this assumption is seldom met in practice. However, Monte Carlo simulation studies indicate that multivariate normality is not critical for discriminant to be effective. Discriminant follows from a view that the domain of interest is composed of separate populations, each of which is measured on variables that follow a multivariate normal distribution. Discriminant attempts to find the linear combinations of these measures that best separate the populations. This is represented in Figure 6.1. The two populations are best separated along an axis (discriminant function) that is a linear combination of x and y. The midpoint between the two populations is the cut-point. This function and cut-point would be used to classify future cases.


Predictive Modeling With Clementine Logistic regression, as we have seen in Chapter 5, is derived from a view of the world in which individuals fall more along a continuum. This difference in formulation led discriminant to be employed in credit analysis (there are those who repay loans and those who don’t), while logistic regression was used to make risk adjustments in medicine (depending on demographics, health characteristics and treatment you are more or less likely to survive a disease). Despite these different origins, discriminant and logistic give very similar results in practice. Monte Carlo simulation work has not found one to be superior to the other over very general circumstances. There is, of course, the obvious point that if the data are based on samples from multivariate normal populations, then discriminant outperforms logistic regression. One consideration when choosing between these two methods involves how many dichotomous predictor variables (or dummy coded) variables are used in the analysis. Because of the stronger assumptions made about the predictor variables by discriminant, the more categorical variables you have, the more you would lean toward logistic regression. Within the domain of response-based segmentation, from the business side more discriminant analysis is seen, while if the problem is formulated from a marketing perspective as a choice model, logistic models are more common. Note that neither discriminant nor logistic will produce a list of segments. Rather they will indicate which predictor variables (some may represent demographic characteristics) are relevant to the outcome. From the prediction equation or other summary measures you can determine the combinations of characteristics that most likely lead to the desired outcome.

Recommendations Logistic regression and discriminant analysis give very similar results in practice. Since discriminant does make stronger assumptions about the nature of your predictor variables (formally, multivariate normality and equal covariance matrices are assumed), as more of your predictor variables are categorical (and thus need to be dummy coded) or dichotomous, you would move in the direction of logistic regression. Certain research areas have a tradition of using only one of the methods, which may also influence your choice.

6.8 An Example: Discriminant To demonstrate discriminant analysis we take data from a study in which respondents answered, hypothetically, whether they would accept an interactive news subscription service (via cable). Most of the predictor variables are interval scale, the exceptions being GENDER (a dichotomy) and INC (an ordered categorical variable). We would expect few if any of these variables to follow a normal distribution, but will proceed with discriminant. Note that the predictor fields for discriminant must be numeric, although they can be categorical. Most importantly, if you have predictors that are truly categorical, such as region of the US (e.g., northwest, southwest, etc.), even with numeric coding, Discriminant will not create dummy variables for these categories. You will need to create dummy variables yourself (use the SetToFlag node), and then enter the dummy variables in the model, leaving one out so as not to create redundancy. In this current example we won’t face this issue.


Predictive Modeling With Clementine As in our other examples, we will move directly to the analysis although ordinarily you would run data checks and exploratory data analysis first.

Click File…Open Stream and then move to the c:\Train\ClemPredModel folder Double-click on Discriminant.str Right-click on the Table node and select Execute to view the data

Figure 6.2 The Interactive News Study Data

Place a Discriminant node from the Modeling palette to the right of the Type node Connect the Type node to the Discriminant node

The name of the Discriminant node will immediately change to NEWSCHAN, the outcome field. Figure 6.3 Discriminant Node Added to the Stream



Double-click on the Discriminant node Click on the Model tab

Figure 6.4 Discriminant Dialog

The Use partitioned data option can be used to split the data into separate samples for training and testing. This may provide an indication of how well the model will work with new data. We will not use this option in this example, but instead will take advantage of a different option for validating the model (Leave-one-out classification) that is built into the Discriminant procedure.

The Method option allows you to specify how you want the predictors entered into the model. By default, all of the terms are entered into the equation. If you do not have a particular model in mind, you can invoke the Stepwise option that will enter predictor variables into the equation based on a statistical criterion. At each step, terms that have not yet been added to the model are evaluated, and if the best of those terms adds significantly to the predictive power of the model, it is added. Some analysts prefer to enter all the predictor variables into the equation and then evaluate which are important. However, if there are many correlated predictor variables, you run


Predictive Modeling With Clementine the risk of multicollinearity, in which case a Stepwise method may be preferred. A drawback is that the Stepwise method has a strong tendency to overfit the training data. When using this method, it is especially important to verify the validity of the resulting model with a hold-out test sample or new data (which is common practice in data mining).

Click on the Method button and select Stepwise Figure 6.5 Discriminant Analysis with Method Stepwise

Click on the Expert tab Click on Expert mode


Predictive Modeling With Clementine Figure 6.6 Discriminant Expert Options

You can use the Prior Probabilities area to provide Discriminant with information about the distribution of the outcome in the population. By default, before examining the data, Discriminant assumes an observation is equally likely to belong to each group. If you know that the sample proportions reflect the distribution of the outcome in the population then you can use the Compute from group sizes option to instruct Discriminant to make use of this information. For example, if an outcome category is very rare, Discriminant can make use of this fact in its prediction equation. We don’t know what the proportions would be so we retain the default. The Use covariance matrix option is often useful whenever the homogeneity of variance option is not met. In general, if the groups are well separated in the discriminant space, heterogeneity of variance will not be terribly important. However, in situations when you do violate the equal variance assumption, it may be useful to use the Separate-groups covariance matrices to see if your predictions change by very much. If they do, that would suggest that the violation of the equal variance assumption was serious. It should be noted that using separate-groups covariance matrices does not affect the results prior to classification. This is because Clementine does not use the original scores to do the classification. Thus, the use of the Fisher classification functions is not equivalent to classification by Clementine with separate covariance matrices.

Click the Output button


Predictive Modeling With Clementine Figure 6.7 Discriminant Advanced Output Dialog

Checking Univariate ANOVAs will have Clementine display significance tests of between-group (outcome categories) differences on each of the predictor variables. The point of this is to provide some hint as to which variables will prove useful in the discriminant function, although this is precisely what discriminant will resolve. The Box’s M statistic is a direct test of the equality of covariance matrices. The covariance matrices are ancillary output and very rarely viewed in practice. However, you might view the within-groups correlations among the predictor variables to identify highly correlated predictors. Either Fisher's coefficients or the unstandardized discriminant coefficients can be used to make predictions for future observations (customers). Both sets of coefficients produce the same predictions when equal covariance matrices are assumed. If there are only two outcome categories (as is our situation), either set of coefficients is easy to use. If you want to try “what if” scenarios using a spreadsheet, the unstandardized coefficients, which involve a single equation in the two-outcome case, are more convenient. If you run discriminant with more than two outcome categories, then Fisher's coefficients are easier to apply as prediction rules. Casewise results can be used to display the codes for the actual group, predicted group, posterior probabilities, and discriminant scores for each case. The Summary table, also known by several other names including Classification table, Misclassification Table, and Confusion table, displays the number of cases correctly and incorrectly assigned to each of the groups based on the discriminant analysis. The Leave-one-out classification classifies each case based on discriminant coefficients calculated while the case is excluded from the analysis. This is a jackknife method and provides a classification table that should at least slightly better generalize to other samples.


Predictive Modeling With Clementine You can also produce a Territorial map, which is a plot of the boundaries used to classify cases into groups, but the map will not be displayed if there is only one discriminant function (the maximum number of functions is equal to the number of categories – 1 in the outcome field). The Stepwise options allow you to display a Summary of statistics for all variables after each step.

Click the Means, Univariate ANOVAS, and Box’s M check boxes in the Descriptives area

Click the Fisher’s and Unstandardized check boxes in the Function Coefficients area Click the Summary table and Leave-one-out classification check boxes in the

Classification area Figure 6.8 Discriminant Advanced Output dialog After Option Selection

Click OK Click the Stepping button


Predictive Modeling With Clementine Figure 6.9 Stepping Dialog

Wilks’ lambda is the default and probably the most common method. The differences between the methods are somewhat technical and beyond the scope of this course. You can change the statistical criterion for variable entry. For example, you might want to make the criterion more stringent when working with a large sample.

Click Cancel Click the Execute button Browse the Discriminant generated model in the Models Manager window Click the Advanced tab, and then expand the browsing window Scroll to the Classification Results

Figure 6.10 Classification Results Table

Although this table appears at the end of the discriminant output, we turn to it first. It is an important summary since it tells us how well we can expect to predict the outcome. The actual (known) groups constitute the rows and the predicted groups make up the columns. Of the 227 people surveyed who said they would not accept the offering, the discriminant model correctly predicted 157 of them; thus accuracy for this group is 69.2%. For the 214 respondents who said


Predictive Modeling With Clementine they would accept the offering, 66.4% were correctly predicted. Overall, the discriminant model was accurate in 67.80% of the cases. Is this good? Will this model work well with new data? The answer to the first question will largely depend on what level of predictive accuracy you required before you began the project. One way we can assess the success of the model is to compare these results with the predictions we would have made if we simply guessed the larger group. If we simply did that, we would be correct in 227 of 441 (227 + 214) instances, or about 51.5% of the time. The 67.8% correct figure, while certainly far from perfect accuracy, does far better than guessing. The Cross-validated portion of the table gives us an idea about how accurate this model will be with new data. The percent of correctly classified cases has decreased slightly from 67.8% to 67.3% for the cross-validation. Because these results are virtually identical, it appears the model is valid. Since we are interested in discovering which characteristics are associated with someone who accepts the news channel offer, we proceed.

Scroll back to the Group Statistics pivot table Figure 6.11 Group Statistics

Viewing the means by themselves is of limited use, but notice the group that would accept the service is about 7 years older than the group that would not accept, whereas the daily hours of TV viewing are almost identical for the two groups. The standard deviations are very similar across groups, which is promising for the equal covariance matrices assumption.

Scroll to the Tests of Equality of Group Means pivot table


Predictive Modeling With Clementine Figure 6.12 Univariate F Tests

The significance tests of between-group differences on each of the predictor variables provide hints as to which will be useful in the discriminant function (recall we are using Wilks' criterion as a stepwise method). Notice Age in Years has the largest F (is most significant) and will be first selected in the stepwise solution. This table looks at each variable ignoring the others, while discriminant adjusts for the presence of the other variables in the equation (as would regression). Scroll to the Box’s M test results Figure 6.13 Box’s M Test Results

Because the significance value is well above 0.05, we can accept the null hypothesis that the covariance matrices are equal. However, the Box’s M test is quite powerful and leads to rejection of equal covariances when the ratio N/p is large, where N is the number of cases and p is the number of variables. The test is also sensitive to lack of multivariate normality, which applies to these data. If they were unequal, the effect on the analysis would be to create errors in the assignment of cases to groups.

Scroll to the Eigenvalues and Wilks’ Lambda portion of the output


Predictive Modeling With Clementine Figure 6.14 Summaries of Discriminant Function (Eigenvalues and Wilks’ Lambda)

These two tables are overall summaries of the discriminant function. The canonical correlation measures the correlation between a variable (or variables when there are more than two groups) contrasting the groups and an optimal (in terms of maximizing the correlation) linear combination of the predictor variables. In short, it measures the strength of the relationship between the predictor variables and the groups. Here, there is a modest (.363) canonical correlation. Wilks’ lambda provides a multivariate test of group differences on the predictor variables. If this test were not significant (it is highly significant), we would have no basis on which to proceed with discriminant analysis. Now we view the individual coefficients. Scroll down until you see the Standardized Coefficients and Structure Matrix Figure 6.15 Standardized Coefficients and Structure Matrix

Standardized discriminant coefficients can be used as you would use standardized regression coefficients in that they attempt to quantify the relative importance of each predictor in the discriminant function. The only three predictors that were selected by the stepwise analysis were


Predictive Modeling With Clementine Education, Gender and Age. Not surprisingly, age is the dominant factor. The signs of the coefficients can be interpreted with respect to the group means on the discriminant function. Notice the coefficient for gender is negative. Other things being equal, as you shift from a man (code 0) to a woman (code 1), this results in a one unit change, which when multiplied by the negative coefficient will lower the discriminant score, and move the individual toward the group with a negative mean (those that don’t accept the offering). Thus women are less likely to accept the offering, adjusting for the other predictors. The Structure Matrix displays the correlations between each variable considered in the analysis and the discriminant function(s). Note that income category correlates more highly with the function than gender or education, but it was not selected in the stepwise analysis; this is probably because income correlated with predictors that entered earlier. The standardized coefficients and the structure matrix provide ways of evaluating the discriminant variables and the function(s) separating the groups.

Scroll down to the Canonical Discriminant Function Coefficients and Functions at Group Centroids are visible

Figure 6.16 Unstandardized Coefficients and Group Means (Centroids)

In Figure 6.1 we saw a scatterplot of two separate groups and the axis along which they could be best separated. Unstandardized discriminant coefficients, when multiplied by the values of an observation, project an individual onto this discriminant axis (or function) that separates the groups. If you wish to use the unstandardized coefficients for prediction purposes, you would simply multiple a prospective customer’s education, gender and age values by the corresponding unstandardized coefficients and add the constant. Then you compare this value to the cut-point (by default the midpoint) between the two group means (centroids) along the discriminant


Predictive Modeling With Clementine function (the cut-point appears in Figure 6.1). If the prospective customer’s value is greater than the cut point you predict the customer will accept, if the score is below the cut point you predict the customer will not accept. This prediction rule is also easy to implement with two groups, but involves much more complex calculations when more than two groups are involved. It is in a convenient form to do “what if” scenarios, for example, it we have a male with 16 years of education at what age would such an individual be a good prospect? To answer this we determine the age value that moves the discriminant score above the cut-point.

Scroll down until you see the Classification Function Coefficients Figure 6.17 Fisher Classification Coefficients

Fisher function coefficients can be used to classify new observations (customers). If we know a prospective customer’s education (say 16 years), gender (Female=1) and age (30), we multiply these values by the set of Fisher coefficients for the No (no acceptance) group (2.07*16 + 1.98*1 + .32*30 –20.85) which yield a numeric score. We repeat the process using the coefficients for the Yes group and obtain another score. The customer is then placed in the outcome group for which she has the higher score. Thus the Fisher coefficients are easy to incorporate later into other software (spreadsheets, databases) for predictive purposes. We did not test for the normality assumptions of discriminant analysis in this example. In general, normality does not make a great deal of difference, but heterogeneity of the covariance matrices can, especially if the sample group sizes are very different. Here the samples sizes were about the same. As mentioned earlier, whether you consider the hit rate here to be adequate really depends on the costs of errors, the benefits of a correct prediction and what your alternatives are. Here, although the prediction was far from perfect we were able to identify the relations between the demographic variables and the choice outcome.




Exercise Data Information Sheets The exercises in this chapter is written around the data files credit.sav. The following section give details of the file. Credit.sav has the same fields as risktrain.txt except that they are all numeric so that we can use them all in a Discriminant Analysis. The file contains the following fields: ID ID number AGE Age INCOME Income GENDER Gender MARITAL Marital status NUMKIDS # of dependent children NUMCARDS # of credit cards HOWPAID paid M/Wkly MORTGAGE Mortgage STORECAR # of store cards held LOANS # other loans RISK Credit risk category INCOME1K Income in thousands of British pounds (field derived within Clementine)

1. Begin with a clear Stream canvas. Place an SPSS File source node on the canvas and connect it to Credit.sav.

2. Attach a Type node to the Source node, and a Table node to the Type node. Execute the

stream and allow Clementine to automatically type the fields.

3. Attach a SetToFlag node to the Type node and create separate dummy fields for each category of the marital field. Make sure that you code the True value as 1 and the False value as 0. This is important because Discriminant expects numeric data for the inputs.

4. Attach a Type node to the SetToFlag node.

5. Edit the second Type node and change the direction for risk to Out, and to None for ID,

marital, income1k, and marital_3, or a reference field of your choice. Leave the direction as In for all the rest of the fields.

6. Attach a discriminant node to the second Type node and run the analysis. How many

classification functions are significant? What variables are important predictors?



Chapter 7: Data Reduction: Principal Components

Objectives • Review principal components analysis, a technique used to perform data reduction prior

to modeling • Run a principal components analysis on a dataset of waste production

Data We use a file containing information about the amount of solid waste in thousands of tons (WASTE) in various locations along with information about land use, including number of acres used for industrial work (INDUST), fabricated metals (METALS), trucking and wholesale trade (TRUCKS), retail trade (RETAIL), and restaurants and hotels (RESTRNTS). The data set appears in Chatterjee and Hadi (1988, Sensitivity Analysis in Linear Regression. New York: Wiley).

7.1 Introduction Although it is used as an analysis technique in its own right, in this chapter we discuss principal components primarily as a data reduction technique in support of statistical predictive modeling (for example, regression or logistic regression) and clustering. We first review the role of principal components and factor analysis in segmentation and prediction studies, and then discuss what to look for when running these techniques. Some background principles will be covered along with comments about popular factor methods. We provide some overall recommendations. We will perform a principal components analysis on a set of fields recording different types of land usage, all of which are to be used to predict the amount of waste produced from that land.

7.2 Use of Principal Components for Prediction Modeling and Cluster Analyses In the areas of segmentation and prediction, principal components and factor analysis typically serve in the ancillary role of reducing the many fields available to a core set of composite fields (components or factors) that are used by cluster, regression or logistic regression. Statistical prediction models such as regression, logistic regression, and discriminant analysis, when run with highly correlated input fields can produce unstable coefficient estimates (the problem of near multicollinearity). In these models, if any input field can be almost or perfectly predicted from a linear combination of the other inputs (near or pure multicollinearity), the estimation will either fail or be badly in error. Prior data reduction using factor or principal components analysis is one approach to reducing this risk. Although we have described this problem in the context of statistical prediction models, neural network coefficients can become unstable under these circumstances. However, since the

Data Reduction: Principal Components 7 - 1

Predictive Modeling With Clementine interpretation of neural network coefficients is relatively rarely done, this issue is less prominent. Rule induction methods will run when predictors are highly related. However, if two numeric predictors are highly correlated and have about the same relationship to the output, then the predictor with the slightly stronger relationship to the output will enter into the model. The other predictor is unlikely to enter into the model, since it contributes little in addition to the first predictor. While this may be adequate from the perspective of accurate prediction, the fact that the first field entered the model, while the second didn't, could be taken to mean that the first was important and the second was not. However, if the first were removed, the second predictor would have performed nearly as well. Such relationships among inputs should be revealed as part of the data understanding and data preparation step of a data mining project. If this were not done, or if it were done inadequately, then the data reduction performed by principal components or factor analysis might be necessary (for statistical methods) and helpful (for both statistical and machine learning methods). In some surveys done for segmentation purposes, dozens of customer attitude measures or product attribute ratings may be collected. Although cluster analysis can be run using a large number of cluster fields, two complications can develop. First, if several fields measure the same or very similar characteristics and are included in a cluster analysis, then what they measure is weighted more heavily in the analysis. For example, suppose a set of rating questions about technical support for a product is used in a cluster analysis with other unrelated questions. Since distance calculations used in the Clementine clustering algorithms are based on the differences between observations on each field, then other things being equal, the set of related items would carry more weight in the analysis. To exaggerate to make a point, if two fields were identical copies of each other and both were used in a cluster analysis, the effect would be to double the influence of what they measure. In practice you rarely ask the same number of rating questions about each attribute (or psychographic) area. So principal components and factor analysis are used to either explicitly combine the original input fields into independent composite fields, to guide the analyst in constructing subscales, or to aid in selection of representative sets of fields (some analysts select three fields strongly related to each factor or component to be used in cluster analysis). Cluster is then performed on these fields. A second reason factor or principal components might be run prior to clustering is for conceptual clarity and simplification. If a cluster analysis were based on forty fields it would be difficult to look at so large a table of means or a line chart and make much sense of them. As an alternative, you can perform rule induction to identify the more influential fields and summarize those. If factor or principal components analysis is run first, then the clustering is based on the themes or concepts measured by the factors or components. Or, as mentioned above, clustering can be done on equal-sized sets of fields, where each set is based on a factor. If the factors (components) have a ready interpretation, it can be much easier to understand a solution based on five or six factors, compared to one based on forty fields. As you might expect, factor and principal components analyses are more often performed on “soft” measures—attitudes, beliefs, and attribute ratings— and less often on behavioral measures like usage and purchasing patterns. Keep in mind that factor and principal components analysis are considered exploratory data techniques (although there are confirmatory factor methods; for example, Amos or LISREL can be used to test specific factor models). So as with cluster analysis, do not expect a definitive, unassailable answer. When deciding on the number and interpretation of factors or components, domain knowledge of the data, common sense, and a dose of hard thinking are very valuable.


Predictive Modeling With Clementine 7.3 What to Look for When Running Principal Components or Factor Analysis There are two main questions that arise when running principal components and factor analysis: how many (if any) components are there, and what do they represent? Most of our effort will be directed toward answering them. These questions are related because, in practice, you rarely retain factors or components that you cannot identify and name. Although the naming of components has rarely stumped a creative researcher for long, which has led to some very odd-sounding “components,” it is accurate enough to say that interpretability is one of the criteria when deciding to keep or drop a component. When choosing the number of components, there are some technical aids (eigenvalues, percentage of variance accounted for) we will discuss, but they are guides and not absolute criteria. To interpret the components, a set of coefficients, called loadings or lambda coefficients, relating the components (or factors) to the original fields, are very important. They provide information as to which components are highly related to which fields and thus give insight into what the components represent.

7.4 Principles Factor analysis operates (and principal components usually operates) on the correlation matrix relating the numeric fields to be analyzed. The basic argument is that the fields are correlated because they share one or more common components, and if they didn’t correlate there would be no need to perform factor or component analysis. Mathematically a one-factor (or component) model for three fields can be represented as follows (Vs are fields (or variables), F is a factor (or component), Es represent error variation that is unique to each field (uncorrelated with the F component and the E components of the other variables)):

V1 = L1*F1 + E1 V2 = L2*F1 + E2 V3 = L3*F1 + E3

Each field is composed of the common factor (F1) multiplied by a loading coefficient (L1, L2, L3 - the lambdas) plus a unique or random component. If the factor were measurable directly (which it isn’t) this would be a simple regression equation. Since these equations can’t be solved as given (the Ls, Fs and Es are unknown), factor and principal components analysis take an indirect approach. If the equations above hold, then consider why fields V1 and V2 correlate. Each contains a random or unique component that cannot contribute to their correlation (Es are assumed to have 0 correlation). However, they share the factor F1, and so if they correlate the correlation should be related to L1 and L2 (the factor loadings). When this logic is applied to all the pairwise correlations, the loading coefficients can be estimated from the correlation data. One factor may account for the correlations between the fields, and if not, the equations can be easily generalized to accommodate additional factors. There are a number of approaches to fitting factors to a correlation matrix (least squares, generalized least squares, maximum likelihood), which has given rise to a number of factor methods. What is a factor? In market research factors are usually taken to be underlying traits, attitudes or beliefs that are reflected in specific rating questions. You need not believe that factors or components actually exist in order to perform a factor analysis, but in practice the factors are usually interpreted, given names, and generally spoken of as real things.


Predictive Modeling With Clementine 7.5 Factor Analysis versus Principal Components Analysis Within the general area of data reduction there are two highly related techniques: factor analysis and principal components analysis. They can both be applied to correlation matrices with data reduction as a goal. They differ in a technical way having to do with how they attempt to fit the correlation matrix. We will pursue the distinction since it is relevant to which method you choose. The diagram below is a correlation matrix composed of five numeric fields. Figure 7.1 Correlation Matrix of Five Numeric Fields

Principal components analysis attempts to account for the maximum amount of variation in the set of fields. Since the diagonal of a correlation matrix (the ones) represents standardized variances, each principal component can be thought of as accounting for as much as possible of the variation remaining in the diagonal. Factor analysis, on the other hand, attempts to account for correlations between the fields, and therefore its focus is more on the off-diagonal elements (the correlations). So while both methods attempt to fit a correlation matrix with fewer components or factors than fields, they differ in what they focus on when fitting. Of course, if a principal component accounts for most of the variance in fields V1 and V2 , it must also account for much of the correlation between them. And if a factor accounts for the correlation between V1 and V2 , it must account for at least some of their (common) variance. Thus, there is definitely overlap in the methods and they usually yield similar results. Often factor is used when there is interest in studying relations among the fields, while principal components is used when there is a greater emphasis on data reduction and less on interpretation. However, principal components is very popular because it can run even when the data are multicollinear (one field can be perfectly predicted from the others), while most factor methods cannot. In data mining, since data files often contain many fields likely to be multicollinear or near multicollinear, principal components is used more often. This is especially the case if statistical modeling methods, which will not run with multicollinear predictors, are used. Both methods are available in the PCA/Factor node; by default, the principal components method is used.

7.6 Number of Components When factor or principal components analysis is run there are several technical measures that can guide you in choosing a tentative number of factors or components. The first indicator would be the eigenvalues. Eigenvalues are fairly technical measures, but in principal components analysis, and some factor methods (under orthogonal rotations), their values represent the amount of variance in the input fields that is accounted for by the components (or factors). If we turn back to


Predictive Modeling With Clementine the correlation matrix in Figure 8.1, there are five fields and therefore 5 units of standardized variance to be accounted for. Each eigenvalue measures the amount of this variance accounted for by a factor. This leads to a rule of thumb and a useful measure to evaluate a given number of factors. The rule of thumb is to select as many factors as there are eigenvalues greater than 1. Why? If the eigenvalue represents the amount of standardized variance in the fields accounted for by the factor, then if it is above 1, it must represent variance contained in more than one field. This is because the maximum amount of standardized variance contained in a single field is 1. Thus, if in our five-field analysis the first eigenvalue were 3, it must account for variation in several fields. Now an eigenvalue can be less that 1 and still account for variation shared among several fields (for example 30% of the variation of each of three fields for an eigenvalue of .9), so the eigenvalue of 1 rule is only applied as a rule of thumb. Another aspect of eigenvalues (for principal components and some factor methods) is that their sum is the same as the number of fields, which is equal to the total standardized variance in the fields. Thus you can convert the eigenvalue into a measure of percentage of explained variance, which is helpful when evaluating a solution. Finally, it is important to mention that in applications in which you need to be able to interpret the results, the components must make sense. For this reason, factors with eigenvalues over 1 that cannot be interpreted may be dropped and those with eigenvalues less than 1 may be retained.

7.7 Rotations When factor analysis succeeds you obtain a relatively small number of interpretable factors that account for much of the variation in the original set of fields. Suppose you have eight fields and factor analysis returns a two-factor solution. Formally, the factor solution represents a two-dimensional space. Such a space can be represented with a pair of axes as shown below. While each pair of axes defines the same two-dimensional space, the coordinates of a point would vary depending on which pair of axes was applied. This creates a problem for factor methods since the values for the loadings or lambda coefficients vary with the orientation of axes and there is no unique orientation defined by the factor analysis itself. Principal Components does not suffer from this problem since its method produces a unique orientation. This difficulty for factor analysis is a fundamental mathematical problem. The solutions to it are designed to simplify the task of interpretation for the analyst. Most involve, in some fashion, finding a rotation of the axes that maximizes the variance of the loading coefficients, so some are large and some small. This makes it easier for the analyst to interpret the factors. This is the best that can currently be done, but the fact that factor loadings are not uniquely determined by the method is a valid criticism leveled against it by some statisticians. We will discuss the various rotational schemes in the Methods section below.


Predictive Modeling With Clementine Figure 7.2 Two Dimensional Space

7.8 Component Scores If you are satisfied with a factor analysis or principal components solution, you can request that a new set of fields be created that represent the scores of each data record on the factors. These are calculated by summing the product of each original field and a weight coefficient (derived from the lambda coefficients). These factor score fields can then be used as the inputs for prediction and segmentation analyses. They are usually normalized to have a mean of zero and standard deviation of one. An alternative some analysts prefer is to use the lambda coefficients to judge which fields are highly related to a factor, and then compute a new field which is the sum or mean of that set of fields. This method, while not optimal in a technical sense, keeps (if means are used) the new scores on the same scale as the original fields (this of course assumes the fields themselves share a common scale), which can make the interpretation and the presentation straightforward. Essentially, subscale scores are created based on the factor results, and these scores are used in further analyses.

7.9 Sample Size Since principal components analysis is a multivariate statistical method, the rule of thumb for sample size (commonly violated) is that there should be from 10 to 25 times as many records as there are numeric fields used in the factor or principal components analysis. This is because principal components and factor analysis are based on correlations and for p fields there are p* (p-1)/2 correlations. Think of this as a desirable goal and not a formal requirement (technically if there are p fields there must be p+1 observations for factor analysis to run—but don’t expect reasonable results). If your sample size is very small relative to the number of input fields, you should turn to principal components.


Predictive Modeling With Clementine 7.10 Methods There are several popular methods within the domain of factor and principal components analyses. The common factor methods differ in how they go about fitting the correlation matrix. A traditional method that has been around for many years—for some it means factor analysis— is the principal axis factor method (often abbreviated as PAF). A more modern method that carries some technical advantages is maximum likelihood factor analysis. If the data are ill behaved (say near multicollinear), maximum likelihood, the more refined method, is more prone to give wild solutions. In most cases results using the two methods will be very close, so either is fine under general circumstances. If you suspect there are problems with your data, then principal axis may be a safer bet. The other factor methods are considerably less popular. One factor method, called Q factor analysis, involves transposing the data matrix and then performing a factor analysis on the records instead of the fields. Essentially, correlations are calculated for each pair of records based on the values of the input fields. This technique is related to cluster analysis, but is used infrequently today. Besides the factor methods, principal components can be run and, as mentioned earlier, must be run when the inputs are multicollinear. Similarly, there are several choices in rotations. The most popular by far is the varimax rotation, which attempts to simplify the interpretation of the factors by maximizing the variances of the input fields’ loadings on each factor. In other words, it attempts to finds a rotation in which some fields have high and some low loadings on each factor, which makes it easier to understand and name the factors. The quartimax rotation attempts to simplify the interpretation of each field in terms of the factors by finding a rotation yielding high and low loadings across factors for each field. The equimax rotation is a compromise between the varimax and quartimax rotation methods. These three rotations are orthogonal, which means the axes are perpendicular to each other and the factors will be uncorrelated. This is considered a desirable feature since statements can be made about independent factors or aspects of the data. There are nonorthogonal rotations available (axes are not perpendicular); popular ones are oblimin and promax (runs faster than oblimin). Such rotations are rarely used in data mining, since the point of data reduction is to obtain relatively independent composite measures, and it is easier to speak of independent effects when the factors are uncorrelated. Finally, principal components does not require a rotation, since there is a unique solution associated with it. However, in practice, a varimax rotation is sometimes done to facilitate the interpretation of the components.

7.11 Overall Recommendations For data mining applications, principal components is more commonly performed than factor analysis because of the expected high correlations among the many numeric inputs that are often analyzed, and because there isn't always strong interest in interpreting the results. Varimax rotation is usually done (although it is not necessary for principal components) to simplify the interpretation. If there are not many highly correlated fields (or other sources for ill-behaved data, for example, much missing data), then either principal axis or maximum likelihood factor can be performed. Maximum likelihood has technical advantages, but can produce an ugly solution if the data are not well conditioned (a statistical criterion).


Predictive Modeling With Clementine 7.12 Example: Regression with Principal Components To demonstrate principal components, we will run a linear regression analysis predicting an output (amount of waste produced) as a function of several related inputs (amount of acreage put to different uses). After examining the regression results, we will run principal components analysis and use the first few component score fields as inputs to the regression.

Click File…Open Stream and move to the c:\Train\ClemPredModel directory Double-click on PrincipalComponents.str Right-click on the Table node connected to the Type node, then click Execute Examine the data, and then close the Table window Double-click on the Type node

Figure 7.3 Type Node for Linear Regression Analysis

The INDUST, METALS, TRUCKS, RETAIL, and RESTRNTS fields (which measure the number of acres of a specific type of land usage) will be used as inputs to predict the amount of solid waste (WASTE).

Close the Type node window Double-click on the Regression node named WASTE at the top of the Stream canvas Click the Expert tab, and then click the Expert option button Click the Output button, and then click the Descriptives check box (so it is checked)


Predictive Modeling With Clementine Figure 7.4 Requesting Descriptive Statistics in a Linear Regression Node

In anticipation of checking for correlation among the inputs (although we recommend it anyway), we request descriptive statistics (Descriptives). This will display correlations for all the fields in the analysis. (Note that we could have obtained these correlations from the Statistics node.) We can obtain more technical information about correlated predictors by checking the Collinearity Diagnostics check box.

Click OK, and then click the Execute button Right-click the Regression generated model node named Waste in the Models

Manager window, then click Browse Expand the Analysis topic in the Summary tab

Figure 7.5 Linear Regression Browser Window (Summary Tab)

The estimated regression equation appears in the Summary tab; notice that two of the inputs have negative coefficients.



Click the Advanced tab Scroll to the Pearson Correlation section of the Correlations table in the Advanced tab

of the browser window Figure 7.6 Correlations for Input and Output Fields

All correlations are positive and there are high correlations between the METALS and TRUCKS fields (.893) and between the RESTRNTS and RETAIL fields (.920). Since some of the inputs are highly correlated, this might create stability problems (large standard errors) for the estimated regression coefficients due to near multicollinearity.

Scroll to the Model Summary table Figure 7.7 Regression Model Summary

The regression model with five predictors accounted for about 83% (adjusted R Square) of the variation in the output field (waste). Scroll to the Coefficients table


Predictive Modeling With Clementine Figure 7.8 Linear Regression Coefficients

Two of the significant coefficients (INDUST and RETAIL) have negative regression coefficients, although they correlate positively (see Figure 7.6) with the output field. Although there might be a valid reason for this to occur, this coupled with the fact that RETAIL is highly correlated with another predictor is suspicious. Also, those familiar with regression should note that the estimated beta coefficient for RESTRNTS is above 1, which is another sign of near multicollinearity. It is possible that this situation could have been avoided if a stepwise method had been used (this is left as an exercise). However, we will take the position that the current set of inputs is exhibiting signs of near multicollinearity and we will run principal components as an attempt to improve the situation. Before proceeding, let's examine how well this model fits the data.

Close the Regression browser window Double-click the PCA/Factor model node (named Factor) in the stream canvas

Figure 7.9 PCA/Factor Dialog


Predictive Modeling With Clementine In Simple mode (see Expert tab), the only options involve selection of the factor extraction method (some of these were discussed in the Methods section). Notice that Principal Components is the default method.

Click the Execute button Right-click the PCA/Factor generated model node named Factor in the Models Manager

window, then click Browse Figure 7.10 PCA/Factor Browser Window (Five-Component Solution)

Five principal components were found. Since there were originally five input fields, reducing them to five principal components does not constitute data reduction (but it does solve the problem of multicollinearity). If the solution were successful, we would expect that the variation within the five input fields would be concentrated in the first few components and we could check this by examining the Advanced tab of the browser window. However, instead we will use the Expert options to have the PCA/Factor node select an optimal number of principal components.

Close the PCA/Factor browser window Double-click on the PCA/Factor model node named Factor Click the Expert tab, and then click the Expert Mode option button


Predictive Modeling With Clementine Figure 7.11 Expert Options

The Extract factors option indicates that while in Expert mode, PCA/Factor will select as many factors as there are eigenvalues over 1 (we discussed this rule of thumb earlier in the chapter). You can change this rule or specify a number of factors; this might be done if you prefer more or fewer factors than the eigenvalue rule provides. By default, the analysis will be performed on the correlation matrix; principal components can also be applied to covariance matrices, in which case fields with greater variation will have more weight in the analysis. This is really all we need to proceed, but let's examine the other Expert options. Notice that the Only use complete records check box becomes active when the Expert Mode is selected. By default, PCA/Factor will only use records with complete information on the input fields. If this option is not checked, then a pairwise technique is used. Here for a record with missing values on one or more fields used in the analysis, fields with valid values will be used. However, the created factor score fields will be set to $null$ for these records. Also, substantial amounts of missing data, when the Use only complete records is not selected, can lead to numeric instabilities in the algorithm. The Sort values check box in the Component/Factor format section will have PCA/Factor list the fields in descending order by their loading coefficients on the factor/component for which they load highest. This makes it very easy to see which fields relate to which factors and is especially useful when a many input fields are involved. To further aid this effort, by suppressing loading coefficients less that .3 in absolute value (the Hide values below option) you will only see the larger loadings (small values are replaced with blanks) and not be distracted by small loadings. Although not required, these options make the interpretive task much easier when many fields are involved.



Click the Sort values check box (so it is checked) Click the Hide values below check box (so it is checked) Set the Hide scores below value to 0.3 Click the Rotation button

Figure 7.12 Expert Options (Factor/Component Rotation)

By default, no rotation is performed, which is often the case when principal components is run. The Delta and Kappa text boxes control aspects of the Oblimin and Promax rotation methods, respectively.

Click Cancel Click the Execute button Right-click the PCA/Factor generated model node, named Factor, in the Models

Manager window, then click Browse Click the Model tab


Predictive Modeling With Clementine Figure 7.13 PCA/Factor Browser Window (Two-Component Solution)

The PCA/Factor browser window contains the equations to create component (in this case) or factor score fields from the inputs. Two components were selected based on the eigenvalue greater than 1 rule (recall five were selected in the original analysis under the Simple mode). The coefficients are so small because the components are normalized to have means of 0 and standard deviations of 1, while most inputs have values that extend into the thousands. To interpret the components, we turn to the advanced output.

Click the Advanced tab Scroll to the Communalities table in the Expert Output browser window


Predictive Modeling With Clementine Figure 7.14 Communalities Summary

The communalities represent the proportion of variance in an input field explained by the factors (here principal components). Since initially, as many components are fit as there are inputs, the communalities in the first column (Initial) are trivially 1. They are of interest when a solution is reached (Extraction column). Here the communalities are below 1 and measure the percentage of variance in each input field that is accounted for by the selected number of components (two). Any fields having very small communalities (say .2 or below) have little in common with the other inputs, and are neither explained by the components (or factors), nor contribute to their definition. Of the five inputs, all but INDUST have a large proportion of their variance accounted for by the two components and Indust itself has a communality of .44 (44%).

Scroll to the Total Variance Explained table in the Avanced tab of browser window Figure 7.15 Total Variance Explained (by Components) Table

The Initial eigenvalues area contains all (5) eigenvalues, along with the percentage of variance (of the fields) explained by each and a cumulative percentage of variance. We see in the Extracted Sums of Squared Loadings section that there are two eigenvalues over 1, the first being about twice the size of the second. Two components were selected and they collectively account for about 82 percent of the variance of the 5 inputs. The third eigenvalue is .73, which might be explored as a third component if more input fields were involved (reducing from five fields to three components is not much of a reduction). The remaining two components (fourth and fifth) are quite small. While not pursued here, in practice we might try out a solution with a different number of components.



Scroll to the Component Matrix table in the Advanced tab of the browser window Figure 7.16 Component Matrix (Component or Factor Loadings)

PCA/Factor next presents the Component (or Factor) Matrix that contains the unrotated loadings. If a rotation were requested, this table would appear in addition to a table containing the rotated loadings. The input fields form the rows and the components (or factors if a factor method were run) form the columns. The values in the table are the loadings. If any loading were below .30 (in absolute value), blanks would appear in its position due to our option choice. While it makes no difference here, the option helps focus on the larger (absolute value closer to 1) loadings. The first component seems to be a general component, having positive loadings on all the input fields (recall that they all correlated positively—see Figure 7.6). In some sense, it could represent the total (weighted) amount of land used in these activities. The second component has both positive and negative coefficients, and seems to represent the difference between land usage for trucking and wholesale trade, fabricated metals, and industrial work, versus retail trade, restaurants and hotels. This might be considered a contrast between manufacturing/industrial and service-oriented use of land. This pattern, all fields with positive loadings on the first component (factor) and contrasting signs on coefficients of the second and later components (factors), is fairly common in unrotated solutions. If we requested a rotation, the fields would group into the two rotated components according to their signs on the second component. We should note that when interpreting components or factors, the loading magnitude is important; that is, fields with greater loadings (in absolute value) are more closely associated with the components and are more influential when interpreting the components. We know that the two components account for 82 percent of the variation of the original input fields (a substantial amount), and that we can interpret the components. Now we will rerun the linear regression with the components as inputs.

Close the PCA/Factor browser window Double-click on the Type node located to the right of the PCA/Factor generated model

node named Factor


Predictive Modeling With Clementine Figure 7.17 Type Node Set Up for Principal Components Regression

The two component score fields ($F-Factor-1, $F-Factor-2) are the only fields that will be used as inputs; the original land usage fields have their direction set to None. If both the land usage fields and the component score fields were inputs to the linear regression, we would have only exacerbated the near multicollinearity problem (as an exercise, explain why).

Close the Type node window Execute the Regression model node, named Waste, located in the lower right section

of the Stream canvas Right-click the Regression generated model node named Waste in the Models

Manager, then click Browse Click the Summary tab Expand the Analysis topic


Predictive Modeling With Clementine Figure 7.18 Linear Regression (Using Components as Inputs) Browser Window

The prediction equation for waste is now in terms of the two principal component fields. Notice that the coefficient for the second component has a negative sign, which we will consider when examining the expert output.

Click the Advanced tab Scroll to the Model Summary table

Figure 7.19 Model Summary (Principal Components Regression)

The regression model with two principal component fields as inputs accounts for about 73% of the variance (adjusted R square) in the Waste field. This compares with the 83% in the original analysis (Figure 8.8). Essentially, we are giving up 10% explained variance to gain more stable coefficients and possibly a simpler interpretation. The requirements of the analysis would determine whether this tradeoff is acceptable.

Scroll to the Coefficients table


Predictive Modeling With Clementine Figure 7.20 Coefficients Table (Principal Components Regression)

Both components are statistically significant. The positive coefficient for $F-Factor-1 indicates, not surprisingly, as overall land usage increases, so does the amount of waste. The coefficient for the second component (which represented a contrast of land use for manufacturing/industrial versus service-oriented), which is negative, indicates that, controlling for total land usage, as the amount of manufacturing/industrial land use increases relative to service-oriented usage, waste production goes down. Or, to put it another way, as service-oriented land use increases, relative to manufacturing/industrial, waste production increases. As mentioned before, the interpretation of the component, and thus the regression, results might be made easier by rotating (say using a varimax rotation) the components. Notice that the components, unlike the original fields (see Figure 7.7), have no beta coefficients above 1, indicating that the potential problem with near multicollinearity has been resolved. It is important to note that while we have shifted from a regression with five inputs to a regression with two components, the five inputs are still required to produce predictions because they are needed to create the component score fields.

Additional Readings Those interested in learning more about factor and principal components analysis might consider the book by Kline (1994), Jae-On Kim’s introductory text (1978) and his book with Charles W Mueller’s (1979), and Harry Harman’s revised text (1979) (see the References section).




Exercise Data Information Sheets The exercises in this chapter is written around the data files waste.dat. The following section give details of the file. Waste.dat contains information from a waste management study in which the amount of solid waste produced within an area was related to type of land usage. Interest is in relating land usage to amount of waste produced for planning purposes. Inputs were found to be highly correlated and the data set is used to demonstrate principal components regression. The file contains 40 records and the following fields: INDUST Acreage (US) used for industrial work METALS Acreage used for fabricated metal TRUCKS Acreage used for trucking and wholesale trade RETAIL Acreage used for retail trade RESTRNTS Acreage used for restaurants and hotels WASTE Amount of solid waste produced

1. Working with the current stream from the chapter, request a Varimax rotation of the principal components analysis. Interpret the component coefficients. Use the component score fields from this generated model node as inputs to the Regression node predicting waste. Does the R square change? Explain this. Do the regression coefficients change? How would you interpret them?

2. With the same data, use the Extraction Method drop-down list in the PCA/Factor node to

run a factor analysis instead (using principal axis factoring or maximum likelihood) with no rotation. Compare the results to those obtained by the principal components in the chapter. Are they similar? In what way do they differ? Now rerun the factor analysis, requesting a varimax rotation. How do these results compare to those obtained in the first exercise? Do you find anything that leads you to prefer one to the other?


Time Series Analysis and Forecasting with SPSS Trends

Chapter 8: Time Series Analysis

Objectives • To explain what is meant by the term time series analysis • Outline how time series models work • Demonstrate the main principles behind a time-series forecasting model • Forecast several series at one time • Produce forecasts with a time series model on new observations

8.1 Introduction It is often essential for organizations to plan ahead. In order to do this it is important to forecast events in order to ensure a smooth transition into the future. In order to minimize errors when planning for the future it is necessary to collect information on any factors which may influence plans on a regular basis over time. Once a catalogue of past and current information has been collected, patterns can be identified and these patterns help make forecasts into the future. Even though many organizations may collect historic information relevant to the planning process, forecasts are often made on an ad-hoc basis. This often leads to large forecasting errors and costly mistakes in the planning process. Statistical techniques provide a more scientific basis upon which to base forecasts. By using these techniques, a more structured approach can be used to ensure careful planning which will reduce the chance of making costly errors. Statisticians have developed a whole area of statistical techniques, known as time series analysis, which is devoted to the area of forecasting.

Examples In order to understand how time series analysis works it is useful to give an example. Suppose that a company wishes to forecast the growth of its sales into the future. The benefit of making the forecast is that if the company has an idea of future sales it can plan the production process for its product. In doing so, it can minimize the chances of under producing and having product shortages or, alternatively, overproducing and having excess stock which will need to be stored at additional cost. Prior to being able to make the forecast, the company will need to collect information on its sales over time in order to gain a full picture of how sales have changed in the past. Once this information has been collected it is possible to plot how sales change over time. An example of this is shown in Figure 8.1. Here information on the sales of a product has been collected each month from January 1982 until December 1995.

Time Series Analysis 8 - 1

Time Series Analysis and Forecasting with SPSS Trends Figure 8.1 Plot of Sales Over Time

This is a simple example that demonstrates the idea of time series. Time series analysis looks at changes over time. Any information collected over time is known as a time series. A time series is usually numerical information collected over time on a regular basis. One of the most common uses of time series analysis is to forecast future values of a series. There are a number of statistical time series techniques which can be used to make forecasts into the future. In the above example the forecast would be the future values of sales. Some time series methods can also be used to identify which factors have been important in affecting the series you wish to forecast. For example, to determine whether an advertising campaign has had a significant and beneficial effect on sales. It is also possible to use time series analysis to quantify the likely impact of a change in advertising expenditure on future sales. Other examples of time series analysis and forecasting include:

• Governments using time series analysis to predict the effects of government policies on inflation, unemployment and economic growth.

• Traffic authorities analyzing the effect on traffic flows following the introduction of parking restrictions in city centers.

• The analyses of how stock market prices change over time. By being able to predict when stock market prices rise or fall decisions can be made about the right times to buy and sell shares.



• Companies predicting the effects of pricing policies or increased advertising expenditure

on the sales of their product.

• A company wishing to predict the number of telephone calls at different times during the day, so it can arrange the appropriate level of staffing.

Time series analysis is used in many areas of business, commerce, government and academia, and its value cannot be overstated. A number of time series techniques can be found within the Time Series node in Clementine. This note provides analysts with both a flexible and powerful way to analyze time series data.

8.2 What is a Time Series? A time series is a variable whose values represent equally spaced observations of a phenomenon over time. Examples of time series include quarterly interest rates, monthly unemployment rates, weekly beer sales, annual sales of cigarettes, and so on. In term of a data file, time periods constitute the rows (cases) in your file. Time series analysis is usually based on aggregated data. If we take the monthly sales shown in Figure 8.1, each sale is recorded on a transactional basis with an attached date and/or time stamp. There is usually no business need requiring sales forecasts on a minute-by-minute basis, while there is often great interest in predicting sales on a weekly, monthly, or quarterly basis. For this reason, individual transactions and events are typically aggregated at equally spaced time points (days, weeks, months, etc.), and forecasting is based on these summaries. Also, most software programs that perform time series analysis, including Clementine, expect each row (case) of data to represent a time period, while the columns contain the series to be forecast. Classic time series involves forecasting future values of a time series based on patterns and trends found in the history of that series (exponential smoothing and simple ARIMA) or on predictor variables measured over time (multivariate ARIMA, or transfer functions).

Time Series Models versus Econometric Models Time Series models are models constructed without drawing on any theories concerning possible relationships between the variables. In univariate models, the movements of a variable are explained solely in terms of its own past and its position with respect to time. ARIMA models are the premier time series models for single series. By way of contrast, econometric models are constructed by drawing on theory to suggest possible relationships between variables. Given that you can specify the form of the relationship, econometrics provides methods for estimating the parameters, testing hypotheses, and producing predictions. Your model might consist of a single equation, which can be estimated by some variant of regression, or a system of simultaneous equations, which can be estimated by two-stage least squares or some other technique.


Time Series Analysis and Forecasting with SPSS Trends The Classical Regression Model The classical linear regression model is the conventional starting point for time series and econometric methods. Peter Kennedy, in A Guide to Econometrics (2nd edition, 1985, MIT Press), provides a convenient statement of the model in terms of five assumptions:

• The dependent variable can be expressed as a linear function of a specific set of independent variables plus a disturbance term (error).

• The expected value of the disturbance term is zero. • The disturbances have a constant variance and are uncorrelated. • The observations on the independent variable(s) can be considered fixed in repeated

samples. • The number of observations exceeds the number of independent variables and there are

no exact linear relationships between the independent variables. While regression can serve as a point of departure for both time series and econometric models, it is incumbent on you (the researcher) to generate the plots and statistics which will give some indication of whether the assumptions are being met in a particular context. Assumption 1 is concerned with the form of the specification of the model. Violations of this assumption include omission of important regressors (predictors), inclusion of irrelevant regressors, models nonlinear in the parameters, and varying coefficient models. When assumption 2 is violated, there is a biased intercept. Assumption 3 assumes constant variance (homoscedasticity) and no autocorrelation. (Autocorrelation is the correlation of a variable with itself at a fixed time lag.) Violations of the assumption are the reverse: non-constant variance (heteroscedasticity) and autocorrelation. Assumption 4 is often called the assumption of fixed or nonstochastic independent variables. Violations of this assumption include errors in measurement in the variables, use of lagged values of the dependent variable as regressors (common in time series analysis), and simultaneous equation models. Assumption 5 has two parts. If the number of observations does not exceed the number of independent variables, then your problem has a necessary singularity and your coefficients are not estimable. If there are exact linear relationships between independent variables, software might protect you from the consequences. If there are near-exact linear relationships between your independent variables, you face the problem of multicollinearity. In regression, parameters can be estimated by least squares. Least squares methods do not make any assumptions about the distribution of the disturbances. When you make the assumptions of the classical linear regression model and add to them the assumption that the disturbances are normally distributed, the regression estimators are maximum likelihood estimators (ML). It also can be shown that the least-squares methods produce Best Linear Unbiased estimates (BLU). The BLU and ML properties allow estimation of the standard errors of the regression coefficients and the standard error of the estimate, and therefore enable the researcher to do hypothesis testing and calculate confidence intervals.


Time Series Analysis and Forecasting with SPSS Trends 8.3 A Time Series Data File To show you what a time series data file looks like, we open a Clementine stream.

Click File…Open Stream and move to the c:\Train\ClemPredModel folder Double-click on Time Series Intro.str Execute the Table node

Figure 8.2 A Time Series Data File

Each column in the data editor corresponds to a given variable. The important point to note concerning the organization of time series data is that each row in the Table window corresponds to a particular period of time. Each row must therefore represent a sequential time period. The above example shows a data file containing monthly data for sales starting in January 1982. In order to use standard time series methods it is important to collect, or at least be able to summarize, the information over equal time periods. Within a time series data file it is essential that the rows represent equally spaced time periods. Even time periods for which no data was collected must be included as rows in the data file (with missing values for the variables).

The Sequence Chart The simplest way of identifying patterns in your data is to plot your information over the relevant time period. The data file contains the recorded sales of a product over a fourteen-year period. In Clementine there is a facility to show how time series change over time, known as a sequence chart. The sequence chart plots the value of the variable on the vertical axis at given points in time. Time is represented on the horizontal axis. A sequence chart can show several variables on


Time Series Analysis and Forecasting with SPSS Trends the same chart. Points are joined up to display a line graph which show any patterns in your data.. The simplest way of identifying patterns in your data is to plot your information over the relevant time period and is essential for time series analysis. In our sales example, the interest might be to see how sales have changed over the fourteen-year period of interest.

Double-click on the Time Plot node to the right of the Time Intervals node to open it Use the variable selector tool to select Sales

Figure 8.3 Time Plot Dialog

Click Execute

There is an option to Display series in separate panels which can be used to generate a separate chart for each series if you want to plot several of them at once. If you do not check this option, all variables are plotted on one chart. Figure 8.4 shows how sales have changed over the fourteen years.


Time Series Analysis and Forecasting with SPSS Trends Figure 8.4 Sequence Plot of Sales

The sequence chart is the most powerful exploratory tool in time series analysis and it can be used to identify trend, seasonal and cyclical patterns in a time series. There is a clear regularity to the times series, and the volume of sales generally increases over time.

8.4 Trend, Seasonal and Cyclic Components After identifying important patterns that have occurred in the past, time series analysis uses this information to forecast into the future. In Figure 8.4 there are clear patterns in past sales. These patterns can be divided into three main categories: trend, seasonal components and cycles.

Trend Patterns Trend refers to the smooth upward or downward movement characterizing a time series over a long period of time. This type of movement is particularly reflective of the underlying continuity of fundamental demographic and economic phenomena. Trend is sometimes referred to as secular trend where the word secular is derived from the Latin word saeculum, meaning a generation or age. Hence, trend movements are thought of as long-term movements, usually requiring 15 or 20 years to describe (or the equivalent for series with more frequent time intervals). Trend movements might be attributable to factors such as population change, technological progress, and large-scale shifts in consumer tastes. For example, if we could examine a time series on the number of pairs of shoes produced in the United States extending annually, say, from the 1700s until the present, we would find an underlying trend of growth throughout the entire period, despite fluctuations around this general upward movement. If we would compare the figures of the recent time against those near the


Time Series Analysis and Forecasting with SPSS Trends beginning of the series, we would find the recent numbers are much larger. This is because of the increase in population, because of the technical advances in shoe-producing equipment enabling vastly increased levels of production, and because of shifts in consumer tastes and levels of affluence which have meant a larger per capita requirement of shoes than in the earlier time In Figure 8.4 there is a clear upward trend in the data as sales have continued to increase from 1982 until 1995, albeit less pronounced from the beginning of 1991.

Cyclical Patterns Cyclical patterns (or fluctuations), or business cycle movements, are recurrent up and down movements around the trend levels which have a duration of anywhere from about 2 to 15 years. The duration of these cycles can be measured in terms of their turning points, or in other words, from trough to trough or peak to peak. These cycles are recurrent rather than strictly periodic. The height and length (amplitude and duration) of cyclical fluctuations in industrial series differ from those of agricultural series, and there are differences within these categories and within individual series. Hence, cycles in durable goods activity generally display greater relative fluctuations than consumer goods activity and a particular time series of, say, consumer goods activity may possess business cycles which have considerable variations in both duration and amplitude. Economists have produced a large number of explanations of business cycle fluctuations including external theories which seek the causes outside the economic system, and internal theories in terms of factors within the economic system that lead to self-generating cycles. Since it is clear from the foregoing discussion that there is no single simple explanation of business cycle activity and that there are different types of cycles of varying length and size, it is not surprising that no highly accurate method of forecasting this type of activity has been devised. Indeed, no generally satisfactory mathematical model has been constructed for either describing or forecasting these cycles, and perhaps never will be. Therefore, it is not surprising to find that classical time series analysis adopts a relatively rough approach to the statistical measurement of the business cycle. The approach is a residual one; that is, after trend and seasonal variations have been eliminated from a time series, by definition, the remainder or residual is treated as being attributable to cyclical and irregular factors. Since the irregular movements are by their very nature erratic and not particularly tractable to statistical analysis, no explicit attempt is usually made to separate them from cyclical movements, or vice versa. However, the cyclical fluctuations are generally large relative to these irregular movements and ordinarily no particular difficulty in description or analysis arises from this source. Therefore, unless you have data available over a long period of time, cyclic patterns are not usually fit by forecasting models.

Seasonal Patterns Seasonal variations are periodic patterns of movement in a time series. Such variations are considered to be a type of cycle that completes itself within the period of calendar year, and then continues in a repetition of this basic pattern. The major factors in this seasonal pattern are weather and customs, where the latter term is broadly interpreted to include patterns in social behavior as well as observance of various holidays such as Christmas and Easter. Series of monthly or quarterly data are ordinarily used to examine these seasonal patterns. Hence, regardless of trend or cyclical levels, one can observe in the United States that each year more ice cream is sold during the summer months than during the winter, whereas more fuel oil for home heating purposes is consumed in the winter than during the summer months. Both of these cases illustrate the effect of climatic factors in determining seasonal patterns. Also, department store sales generally reveal a minor peak during the months in which Easter occurs and a larger peak in


Time Series Analysis and Forecasting with SPSS Trends December, when Christmas occurs, reflecting the shopping customs of consumers associated with these dates. Seasonal patterns need not be linked to a calendar year. For example, if we studied the daily volume of packages delivered by a private delivery service, the periodic pattern might well repeat weekly (heavier deliveries mid-week, lighter deliveries on the weekend). Here the period for the seasonal pattern could be seven days. Of course, if daily data were collected over several years, then there may well be a yearly pattern as well, and just which time period constitutes a season is no longer clear. The number of time periods that occur during the completion of a seasonal pattern is referred to as the series periodicity. How often the time series data are collected usually depends on the type of seasonality that the analyst expects to find.

• For hourly data, where data are collected once an hour, there is usually one seasonal pattern every twenty-four hours. The periodicity is most likely to be 24.

• For monthly data, where each month a new time period of data is collected, there is usually one seasonal pattern every twelve months. The periodicity is thus likely to be 12.

• For daily data, where data are collected once every day, there is usually one seasonal pattern per week. The periodicity is therefore 7 if the data refer to a seven-day week or 5 if no data are collected on Saturdays and Sundays.

• For quarterly data, where data are collected once every three months, there is usually one seasonal pattern per year. The periodicity is therefore 4.

• For annual data, where data are collected once a year, there is no seasonal pattern. The periodicity is therefore none (undefined).

Of course, changes can occur in seasonal patterns because of changing institutional and other factors. Hence, a change in the date of the annual automobile show can change the seasonal pattern of automobile sales. Similarly, the advent of refrigeration techniques with the corresponding widespread use of home refrigerators has brought about a change of seasonal pattern of ice cream sales. The techniques of measurement of seasonal variation which we will discuss are particularly well suited to the measurement of relatively stable patterns of seasonal variation, but can be adapted to cases of changing seasonal movements as well. In Figure 8.4, there appears to be a rise in sales during the early part of the year while sales tend to fall to a low around November. Finally, there is some recovery in sales leading up to the Christmas period of each year.

Irregular Movements Irregular movements are fluctuations in time series that are erratic in nature, and follow no regularly recurrent or other discernible pattern. These movements are sometimes referred to as residual variations, since, by definition, they represent what is left over in an economic time series after trend, cyclical, and seasonal elements have been accounted for. These irregular fluctuations result from sporadic, unsystematic occurrences such as wars, earthquakes, accidents, strikes, and the like. In the classical time series model, the elements of trend, cyclical, and seasonal variations are viewed as resulting from systematic influences leading to gradual growth, decline, or recurrent movements. Irregular movements, however, are considered to be so erratic that it would be fruitless to attempt to describe them in terms of a formal model. Irregular movements can result from a large number of causes of widely differing impact.


Time Series Analysis and Forecasting with SPSS Trends 8.5 What is a Time Series Model? A time series model is a tool used to predict future values of a series by analyzing the relationship between the values observed in the series and the time of their occurrence. Time series models can be developed using a variety of time series statistical techniques. If there has been any trend and/or seasonal variation present in the data in the past then time series models can detect this variation, use this information in order to fit the historical data as closely as possible, and in doing so improve the precision of future forecasts. Time Series techniques in Clementine can be categorized in the following ways: Pure time series models Exponential Smoothing Causal time series models Linear Time Series Regression Intervention Analysis Both Pure and Causal ARIMA

Pure Versus Causal Time Series Models A distinction can be made between pure and causal time series models.

Pure time Series Models Pure time series models utilize information solely from the series itself. In other words, pure time series forecasting makes no attempt to discover the factors affecting the behavior of a series. For example, if the aim were to forecast future sales for a product, then a pure time series model would use just the data collected on sales. Information on other explanatory forces such as advertising expenditure and economic conditions would not be used when developing a pure time series model. In such models it is assumed that some pattern or combination of patterns in the series which is to be forecasted is recurring over time. Identifying and extrapolating that pattern can develop forecasts for subsequent time periods. The main advantage of pure time series modeling is that it is a quick and simple way of developing a forecast model. Also, such models rely upon little statistical theory. One obvious disadvantage of pure time series models, such as exponential smoothing, is that they cannot identify important factors influencing the series. Another drawback is that it is not possible to accurately predict the impact of any decisions taken by an organization on the future values of the series.

Causal time series models Causal time series models such as regression and ARIMA will incorporate data on influential factors to help predict future values of a series. In such models, a relationship is modeled between a dependent variable (the time series being predicted), time, and a set of independent variables (other associated factors also measured over time). The first task of forecasting is to find the cause-and-effect relationship. In our sales example, a causal time series technique such as regression would indicate whether advertising expenditure or the price of the product has been an important influence on sales and if it has, whether each factor has had a positive or negative influence on sales. The real advantage of an explanatory model is that a range of forecasts corresponding to a range of values for the different variables can be developed. For example, causal time series models can assess what the effect of a $100,000 increase in advertising


Time Series Analysis and Forecasting with SPSS Trends expenditure will have on future sales, or alternatively a $150,000 increase in advertising expenditure. The main drawbacks of causal time series models are that they require information on several variables in addition to the variable that is being forecast and usually take longer to develop. Furthermore, the model may require estimation of the future values of the independent factors before the dependent variable can be forecast.

8.6 Interventions Time series may experience sudden shifts in level, upward or downward, as a result of external events. For example, sales volume may briefly increase as the result of a direct marketing campaign or a discount offering. If sales were limited by a company’s capacity to manufacture a product, then bringing a new plant online would shift the sales level upward from that date onward. Similarly, changes in tax laws or pricing may shift the level of a series. The idea here is that some outside intervention resulted in a shift in the level of the series. In this context, a distinction is made between a pulse—that is, a sudden, temporary shift in the series level—and a step, a sudden, permanent shift in the series level. A bad storm, or a one-time, 30-day rebate offer, might result in a pulse, while a change in legislation or a large competitor’s entry into a market could result in a step change to the series. Time series models are designed to account for gradual, not sudden, change. As a result, they do not natively fit pulse and step effects very well. However, if you can identify events (by date) that you believe are associated with pulse or step effects, they can be incorporated into time series models (they are called intervention effects) and forecasts. Below we see an example of a pulse intervention. In April 1975 a one-time tax rebate occurred in an attempt to stimulate the US economy, then in recession. Note that the savings rate reached its maximum (9.7%) during this quarter. The intervention can be modeled and used in scenarios to assess the effect of a tax rebate on savings rates in the future.


Time Series Analysis and Forecasting with SPSS Trends Figure 8-5 U. S. Savings Rate (Seasonal Adjusted)—Tax Rebate in April 1975

8.7 Exponential Smoothing The expert modeler in SPSS trends considers two classes of time series models when searching for the best forecasting model for your data: exponential smoothing and ARIMA. In this section we provide a brief introduction to simple exponential smoothing. Exponential smoothing is a time series technique that can be a relatively quick way of developing forecasts. This technique is a pure time series method; this means that the technique is suitable when data has only been collected for the series that you wish to forecast. In comparison, ARIMA models can accommodate predictor variables and intervention effects. Exponential smoothing takes the approach that recent observations should have relatively more weight in forecasting than distant observations. “Smoothing” implies predicting an observation by a weighted combination of the previous values. “Exponential” smoothing implies that the weights decrease exponentially as the observations get older. “Simple” (as in simple exponential smoothing) implies that slowly changing level is all that is being modeled. Exponential smoothing can be extended to model different combinations of trend and seasonality. Exponential smoothing implements many models in this fashion. An analyst using custom exponential smoothing typically examines the series to make some broad characterizations (is there trend, and if so what type? Is there seasonality [a repeating pattern], and if so what type?) and fits one or more models. The best model fit is then extrapolated into the future to make forecasts. One of the main advantages of exponential smoothing is that models can be easily constructed. The type of exponential smoothing model


Time Series Analysis and Forecasting with SPSS Trends developed will depend upon the seasonal and trend patterns inherent in the series you wish to forecast. An analyst building a model might simply observe the patterns in a sequence chart to decide which type of exponential smoothing model is the most promising one to generate forecasts. In SPSS Trends, when the Expert Modeler examines the series, it considers all appropriate exponential smoothing models when searching for the most promising time series model. Simple exponential smoothing (no trend, no seasonality) can be described in two algebraically equivalent ways. One common formula, known as the recurrence form, is as follows:

)1()()( *)1(* −−+= ttt SyS αα Also, the forecast

)()( tm Sy = where y(t) is the observed value of the time series in period t, S(t-1) is the smoothed level of the series at time t-1, α (alpha) is the smoothing parameter for the level of the series, and S(t) is the smoothed level of the series at time t, computed after y(t) is observed, and y(m) is the model estimated m step ahead forecast at time t. Intuitively, the formula states that the current smoothed value is obtained by combining information from two sources: the current point and the history embodied in the series. Alpha (α) is a weight ranging between 0 and 1. The closer alpha is to 1, the more exponential smoothing weights the most recent observation and the less it weights the historical pattern of the series. The smoothed value for the current case becomes the forecast value. This is the simplest form of an exponential smoothing model. As mentioned above and as will be detailed in a later chapter, extensions of the exponential smoothing model can accommodate several types of trend and seasonality, yielding a general model capable of fitting single-series data.

8.8 ARIMA Many of the ideas that have been incorporated into ARIMA models were developed in the 1970s by George Box and Gwilym Jenkins, and for this reason ARIMA modeling is sometimes called Box-Jenkins modeling. ARIMA stands for AutoRegressive Integrated Moving Average, and the assumption of these models is that the variation accounted for in the series variable can be divided into three components:

• Autoregressive (AR) • Integrated (I) or Difference • Moving Average (MA)

An ARIMA model can have any component, or combination of components, at both the nonseasonal and seasonal levels. There are many different types of ARIMA models and the general form of an ARIMA model is ARIMA(p,d,q)(P,D,Q), where:

• p refers to the order of the nonseasonal autoregressive process incorporated into the ARIMA model (and P the order of the seasonal autoregressive process)



• d refers to the order of nonseasonal integration or differencing (and D the order of the seasonal integration or differencing)

• q refers to the order of the nonseasonal moving average process incorporated in the model (and Q the order of the seasonal moving average process).

So for example an ARIMA(2,1,1) would be a nonseasonal ARIMA model where the order of the autoregressive component is 2, the order of integration or differencing is 1, and the order of the moving average component is also 1. ARIMA models need not have all three components. For example, an ARIMA(1,0,0) has an autoregressive component of order 1 but no difference or moving average component. Similarly, an ARIMA(0,0,2) has only a moving average component of order 2.

Autoregressive In a similar way to regression, ARIMA models use independent variables to predict a dependent variable (the series variable). The name autoregressive implies that the series values from the past are used to predict the current series value. In other words, the autoregressive component of an ARIMA model uses the lagged values of the series variable, that is, values from previous time points, as predictors of the current value of the series variable. For example, it might be the case that a good predictor of current monthly sales is the sales value from the previous month. The order of autoregression refers to the time difference between the series variable and the lagged series variable used as a predictor. If the series variable is influenced by the series variable two time periods back, then this is an autoregressive model of order two and is sometimes called an AR(2) process. An AR(1) component of the ARIMA model is saying that the value of series variable in the previous period (t-1) is a good indictor and predictor of what the series will be now (at time period t). This pattern continues for higher-order processes. The equation representation of a simple autoregressive model (AR(1)) is:

aeyy ttt ++Φ= − )()1(1)( *

Thus the series value at the current time point (y(t)) is equal to the sum of: (1) the previous series value (y(t-1)) multiplied by a weight coefficient (Φ1); (2) a constant a (representing the series mean); and (3) an error component at the current time point (e(t)).

Moving Average The autoregressive component of an ARIMA model uses lagged values of the series values as predictors. In contrast to this, the moving average component of the model uses lagged values of the model error as predictors. Some analysts interpret moving average components as outside events or shocks to the system. That is, an unpredicted change in the environment occurs, which influences the current value in the series as well as future values. Thus the error component for the current time period relates to the series’ values in the future. The order of the moving average component refers to the lag length between the error and the series variable. For example, if the series variable is influenced by the model’s error lagged one period, then this is a moving average process of order one and is sometimes called an MA(1) process. An MA(1) model would be expressed as:


Time Series Analysis and Forecasting with SPSS Trends aeey ttt ++Φ= − )()1(1)( * Thus the series value at the current time point (y(t)) is equal to the sum of several components: (1) the previous time point’s model error (e(t-1)) multiplied by a weight coefficient (here Φ1); (2) a constant (representing the series mean); and (3) an error component at the current time point (e(t)).

Integration The Integration (or Differencing) component of an ARIMA model provides a means of accounting for trend within a time series model. Creating a differenced series involves subtracting the values of adjacent series values in order to evaluate the remaining component of the model. The trend removed by differencing is later built back into the forecasts by Integration (reversing the differencing operation). Differencing can be applied at the nonseasonal or seasonal level, and successive differencing, although relatively rare, can be applied. The form of a differenced series (nonseasonal) would be: )1()()( −−= ttt yyx Thus the differenced series values (x(t)) is equal to the current series value (y(t)) minus the previous series value (y(t-1)).

Multivariate ARIMA ARIMA also permits a series to be predicted from values in other data series. The relations may be at the same time point (for example, a company spending on advertising this month influences the company’s sales this month) or in a leading or lagging fashion (for example, the company’s spending on advertising two months ago influence the company’s sales this month). Multiple predictor series can be included at different time lags. A very simple example of a multivariate ARIMA model appears below: aexby ttt ++= − )()1(1)( * Here the series value at the current time point (y(t)) is equal to the sum of several components: (1) the value of the predictor series at the previous time point (x(t-1)) multiplied by a weight coefficient (b1); (2) a constant; and (3) an error component at the current time point (e(t)). In a practical context, this model could represent monthly sales of a new product as a function of direct marketing spending the previous month. Complex ARIMA models that include other predictor series, autoregressive, moving average, and integration components can be built in the Time Series node..

8.9 Data Requirements In time series analysis, each time period at which the data are collected yields one sample point to the time series. The idea is that the more sample points you have, the clearer the picture of how the series behaves. It is not reasonable to collect just two months worth of data on the sales of a product and, on the basis of this, expect to be able to forecast two years into the future. This is because your sample size is only two (one sixth of the seasonal span) and you wish to forecast 24 data points, or months, ahead (two full seasonal spans). Therefore the way to view the collection


Time Series Analysis and Forecasting with SPSS Trends of time series information is that the more data points you have, the greater your understanding of the past will be, and the more information you have to use to predict future values in the series. The first important question to be answered is how many data points are required before it is possible to develop time series forecasts. Unfortunately, there is no clear-cut answer to this, but the following factors influence the minimum amount of data required:

• Periodicity • How often the data are collected • Complexity of the time series model

It is important to note that some time series techniques incorporating seasonal effects require several seasonal spans of time series data before it is possible to use them. Usually four or more seasons of data observations is a good rule of thumb to use when attempting to explore seasonal modeling. For example, four years (seasonal spans) worth of quarterly or monthly data would be sufficient, as there are four replications of the time period. At the same time, four years worth of annual data is not enough historic data, as the sample is only four. The four year rule is not, however, a rigid rule, as time series can be developed and used for forecasting with less historic data. Two final thoughts: first, the more complex the time series model, the larger the time series sample size should be. Secondly, time series models assume that the same patterns appear throughout the series. If you are fitting a long series in which a dramatic change occurred that might influence the fundamental relations that exist over time (for example, deregulation in the airline and telecom industries), you may obtain more accurate prediction using only the recent (after the change) data to develop the forecasts.

8.10 Automatic Forecasting in a Production Setting Many analysts need to create forecasts for dozens of series on a regular basis. Typical examples are for inventory control for many different products/parts, or for demand forecasting within segments of customers (geographical, customer type, etc.). In principle, this task is no more complex than what we have already reviewed in the previous chapters. But in practice, it can be demanding simply because of the large number of series which could require data exploration, checking of residuals, etc. Fortunately, the Expert Modeler will automatically find a best-fitting model for any number of series that are added to the dependent variables list, with little work on your part (you can also use one or more independent variables that would apply to all the outcome series). Although you could, if you had the time, do some preliminary work to determine the characteristics of the series, if you need to make regular forecasts on a weekly or monthly basis, it is likely that you won’t have the time to devote to this effort. After models are fit to several series—each series will have its own unique model—you can then easily apply those models in the future, without having to re-estimate or rebuild the models. This will be very time efficient. Of course, when enough time passes, you will most likely want to re-estimate the models, just in case any fundamental processes have changed in the drivers of specific series.


Time Series Analysis and Forecasting with SPSS Trends 8.11 Forecasting Broadband Usage in Several Markets Our example of production forecasting involves a national broadband provider who wants to produce forecasts of user subscriptions in order to predict bandwidth usage. To keep the example relatively manageable, we will use only five time series in the example, although there are 85 series altogether. The file broadband_1.sav contains the monthly number of subscribers for each series from January 1999 to December 2003. After fitting models to the nine series, we want to produce forecasts for the next three months, which will be adequate to prepare for changes in demand/usage. We’ll open the data file and do some data exploration.

Click on File…Open Stream Double-click on the file broadband_1 Execute the Table node

Figure 8.6 Broadband Time Series Data

The file contains information on 85 markets. Rather than looking at all of them, we will focus only on Markets 1 through 5. The Filter node to the right of the source node will filter out the markets we don’t want.

Double-click the Filter node


Time Series Analysis and Forecasting with SPSS Trends Figure 8.7 Filter Node Dialog

The next step is to examine sequence charts of each series, but before doing so we will need to define the periodicity of each series. This is done in the Time Intervals node which is found in the Fields Ops palette.

Place a Type Node to the right of the Filter node Connect the Filter node to the Type node Place a Time Intervals node to the right of the Type node Connect the Type node to the Time Interval node Double-click on the Time Intervals node


Time Series Analysis and Forecasting with SPSS Trends Figure 8.8 Time Intervals Dialog

The Time Interval dropdown is used to define the periodicity of the series. By default it is set to None. While it is not required that you specify a periodicity, unless you do the Expert Modeler will not consider models that adjust for seasonal patterns. In this case, because we have collected data on a monthly basis, we can rightfully expect that the same pattern will repeat itself every twelve months, which is a season. Therefore we will define our time interval as months.

Click on the Time Interval dropdown and select Months


Time Series Analysis and Forecasting with SPSS Trends Figure 8.9 Time Intervals Dialog with Periodicity Defined

The next step is to label the intervals. You can either start labeling from the first record, which in the case of this data file is January, 1999, or build the labels from a field that identifies the time or date of each measurement. In order to use the Start labeling from first record method, you must specify the starting date and/or time to label the records. This method assumes that the records are already equally spaced, with a uniform interval between each measurement. Any missing measurements would be indicated by empty rows in the data. You can use the Build from data method for series that are not equally spaced. This requires that you have a field that contains the time or date of each measurement. Clementine will automatically impute values for the missing time points so that the series will have equally spaced intervals. However, in addition, this method requires a date, time, or timestamp field in the appropriate format to use as input. For example if you have a string field with values like Jan 2000, Feb 2000, etc., you can convert this to a date field using a Filler node. This is the method that we are going to use. However, before we can do this, we must convert the date field from a string to a Date.

Click OK Insert a Filler node between the Filter node and the Type Node


Time Series Analysis and Forecasting with SPSS Trends Figure 8.10 Stream After Adding the Filler Node

Double-click on the Filler node Select DATE_ in the Fill in fields box

Select Always from the Replace: dropdown Type or use the expression builder to insert to_date(DATE_) in the Replace with: box

Figure 8.11 Completed Filler Node

Click OK


Time Series Analysis and Forecasting with SPSS Trends Next, let’s set up the Type node so that the direction for all the outcome series we want to forecast is set to Out and the direction for the newly converted DATE field is set to None. We will also need to instantiate the data.

Double-click on the Type node Set the direction on all the fields from Market1 to Total to Out Set the direction on the DATE field to none Click on Read Values button to instantiate the data

Figure 8.12 Completed Type Node

Click OK Now we can complete the Time Intervals settings.

Double-click on the Time Intervals node Click on Build from data Use the menu on the Field: option to select DATE_


Time Series Analysis and Forecasting with SPSS Trends Figure 8.13 Time Intervals Dialog with Date field added

The New field name extension is used to apply either a Prefix or Suffix to the new fields generated by the node. By default it is $T1_.

Click on the Build tab


Time Series Analysis and Forecasting with SPSS Trends Figure 8.14 Build Tab Dialog

The Build tab allows you to specify options for aggregating or padding fields to match the specified interval. These settings apply only when the Build from data option is selected on the Intervals tab. For example, if you have a mix of weekly and monthly data, you could aggregate or “roll up” the weekly values to achieve a uniform monthly interval. Alternatively, you could set the interval to weekly and pad the series by inserting blank values for any weeks that are missing, or by extrapolating missing values using a specified padding function. When you pad or aggregate data, any existing date or timestamp fields are effectively superseded by the generated TimeLabel and TimeIndex fields and are dropped from the output. Typeless fields are also dropped. Fields that measure time as a duration are preserved—such as a field that measures the length of a service call rather than the time the call started—as long as they are stored internally as time fields rather than timestamp.

Click on the Estimation tab


Time Series Analysis and Forecasting with SPSS Trends Figure 8.15 Estimation Tab Dialog

The Estimation tab of the Time Intervals node allows you to specify the range of records used in model estimation, as well as any holdouts. These settings may be overridden in downstream modeling nodes as needed, but specifying them here may be more convenient than specifying them for each node individually. The Begin Estimation is used to specify when you want the estimation period to begin. ou can either begin the estimation period at the beginning of the data or exclude older values that may be of limited use in forecasting. Depending on the data, you may find that shortening the estimation period can speed up performance (and reduce the amount of time spent on data preparation), with no significant loss in forecasting accuracy. The End Estimation option allows you to either estimate the model using all records up to the end of the data or “hold out” the most recent records in order to evaluate the model. For example, if you hold out the last three records and then specify 3 for the number of records to forecast, you are effectively “forecasting” values that are already known, allowing you to compare observed and predicted values to gauge the model’s effectiveness to forecast into the future. We will use the default settings.

Click the Forecast tab


Time Series Analysis and Forecasting with SPSS Trends Figure 8.16 Forecast Tab Dialog

The Forecast tab of the Time Intervals node allows you to specify the number of records you want to forecast and to specify future values for use in forecasting by downstream Time Series modeling nodes. These settings may be overridden in downstream modeling nodes as needed, but specifying them here may be more convenient than specifying them for each node individually. The Extend records into the future option lets you specify the number of time points you wish to forecast beyond the estimation period. Note that these time points may or may not be in the future, depending on whether or not you held out some historic data for validation purposes. For example, if you hold out 6 records and extend 7 records into the future, you are forecasting 6 holdout values and only 1 future value. The Future indicator field is used to label the generated field to indicate whether a record contains forecast data. The default value for the label is $TI_Future. The Future Values to Use in Forecasting allows you to specify future values for any predictor fields you use. Future values for any predictor fields are required for each record that you want to forecast, excluding holdouts. For example, if you are forecasting next month's revenues for a hotel based on the number of reservations, you need to specify the number of reservations you actually expect. Note that fields selected here may or may not be used in modeling; to actually use a field as a predictor, it must be selected in a downstream modeling node. This dialog box simply gives you a convenient place to specify future values so they can be shared by multiple downstream modeling nodes without specifying them separately in each node. Also note that the list of available fields may be constrained by selections on the Build tab. For example, if Specify fields and functions is selected in the Build tab, any fields not aggregated or padded are dropped from the stream and cannot be used in modeling. The Future value functions lets you choose from a list of functions, or specify a value of your own. For example, you could set the value to the most recent value. The available functions depend on the type of field.

Click on the Extend records into the future check box Specify that you would like to forecast 3 records beyond the estimation period


Time Series Analysis and Forecasting with SPSS Trends Figure 8.17 Completed Forecast Tab Dialog

Click OK The next step is to examine each series with a Sequence chart. We will display all the fields on the same chart.

Place a Time Plot node from the Graphs palette below the Time Intervals node Attach the Time Intervals node to the Time Plot node

Double-click on the Time Plot node Select all the series from Model1 to Total Uncheck the Display Series in separate panels box


Time Series Analysis and Forecasting with SPSS Trends Figure 8.18 Completed Time Plot Dialog

Click Execute


Time Series Analysis and Forecasting with SPSS Trends Figure 8.19 Sequence Chart Output for Each Series

From this graph, it is clear that Broadband usage has been increasing rapidly in the US (even more so in other countries), so we see a steady, very smooth increase for all fields variables. The numbers for Market_1 do begin to dip in the last couple of months, but perhaps this is temporary. There is clearly no seasonality in these data, which makes sense. The number of broadband subscriptions does not rise and fall throughout the year. If we use this fact, we can reduce the time for the Expert Modeler to fit models to these series, since requesting that seasonality be considered will increase processing time. Additionally, because the series we’ve viewed here are so smooth, with no obvious outliers, we’ll not request outlier detection. This will also save on processing time. Note, though, that if you are in doubt about this, it is safer to use outlier detection during modeling.

Place a Time Series node from the Modeling palette near the Time Intervals node Connect the Time Intervals node to the Time Series node

Here is the stream so far:


Time Series Analysis and Forecasting with SPSS Trends Figure 8.20 Stream with Times Series node attached

Double-click on the Time Series node Figure 8.21 Times Series Node

The default method is Expert Modeler which automatically selects the best exponential smoothing or ARIMA model for a series or a group of series. As an alternative, you can use the menu to specify that you only want to specify a custom Exponential Smoothing or ARIMA models. In addition, there is a Reuse Stored Settings option, which allows you to apply an existing model to new data, without re-estimating the model from the beginning. In this way you can save time by re-estimating and producing a new forecast based on the same model settings as before


Time Series Analysis and Forecasting with SPSS Trends but using more recent data. Thus, if the original model for a particular time series was Holt's linear trend, the same type of model is used for re-estimating and forecasting for that data; the system does not reattempt to find the best model type for the new data. We will use the Expert Modeler in this example. In addition, you can specify the confidence intervals you want for the model predictions and residual autocorrelations. By default, a 95% confidence interval is used. ou can set the maximum number of lags shown in tables and in plots of autocorrelations and partial autocorrelations. You must include a Time Intervals node upstream from the Time Series node. Otherwise, the dialog will indicate that no time interval has been defined and the stream will not run. In this example, the settings indicate that the model will be estimated from all the records and that forecasts will be made from 3 time periods beyond the estimation period.

Click the Criteria button Figure 8.22 Criteria Dialog

The All models option should be checked if you want the Expert Modeler to consider both ARIMA and exponential smoothing models. The other two modeling options can be used if you want the Expert Modeler to only consider Exponential smoothing or ARIMA models. The Expert Modeler will only consider seasonal models if periodicity has been defined for the active dataset. When this option is selected, the Expert Modeler considers both seasonal and nonseasonal models. If this option is not selected, the Expert Modeler only considers nonseasonal models. We will uncheck this option because the sequence charts clearly show that there were no seasonal patterns in broadband subscriptions.


Time Series Analysis and Forecasting with SPSS Trends The Events and Interventions option enables you to designate certain fields as event or intervention fields. Doing so identifies a field as containing time series data affected by events (predictable recurring situations, e.g., sales promotions) or interventions (one-time incidents, e.g., a power outage or employee strike). These fields must be of type Flag, Set or Ordered Set, and must be numeric (e.g. 1/0, not T/F, for a Flag field), to be included in this list.

Uncheck the Expert Modeler considers seasonal models option (not shown) Click the Outliers tab

Figure 8.23 Outliers Dialog

The Detect Outliers automatically option is used to locate and adjust for outliers. Outliers can lead to forecasting bias either up or down, erroneous predictions if the outlier is near the end of the series, and increased standard errors. Because there were no obvious outliers in the sequence chart, we will leave this option unchecked.

Click Cancel Click Execute Right-click on the generated model named 6 fields in the Models palette Click Browse


Time Series Analysis and Forecasting with SPSS Trends Figure 8.24 Time Series Model Output (View = Simple)

The Time Series model displays details of the models the Expert Modeler selected for each series. In this case, it chose the Holts exponential smoothing model for the first four series and the last one, and the Winter’s additive exponential smoothing model for the fifth series. Given the likely similar patterns in the series, it is not surprising that the same model was chosen for most of the series. The default output shows for each series the model type, the number of predictors specified, and the goodness-of-fit measure (stationary R-squared is the default). This measure is usually preferable to an ordinary R-squared when there is a trend of seasonal pattern. If you have specified outlier methods, there is a column showing the number of outliers detected. The default output also includes a Ljung-Box Q statistic, which tests for autocorrelation of the error. Here we see that the result was significant for the Model_2, Model_4, and Total series. Latter on, we will examine some residuals plots to see why the results were significant. The default view (Simple) displays the basic set of output columns. For additional goodness-of-fit measures, you can use the View menu to select the Advanced option. The check boxes to the left of each model can be used to choose which models you want to use in scoring. All the boxes are checked by default. The Check all and Un-check all buttons in the upper left act on all the boxes in a single operation. The Sort by option can be used to sort the rows in ascending or descending order of a specified column. As an alternative, you can also click on the column heading itself to change the order.



Click on the View: menu and select Advanced Figure 8.25 Time Series Model Output (View = Advanced)

The Root Mean Square Error (RMSE) is the square root of the mean squared error. The Mean Absolute Percentage Error (MAPE) is obtained by taking the absolute error for each time period, dividing it by the actual series value, averaging these ratios across all time points, and multiplying by 100. The Mean Absolute Error (MAE) takes the average of the absolute values of the errors. The Maximum Absolute Percentage Error (MaxAPE) is the largest absolute forecast error expressed as a percentage. The Maximum Absolute Error (MaxAE) is the largest forecast error, positive or negative. And finally, Normalized Bayesian Information Criterion (Norm BIC) is a general measure of the overall fit of a model that attempts to account for model complexity. From this table, you can easily scan the statistics to look for better, or poorer, fitting models. We can see here that Model_5 has the highest Stationary R-squared value (0.544) and Total has a very low one (0.049). However, the Total series has a lower MAPE than any of the other series. The summary statistics at the bottom of the output provide the mean, minimum, maximum and percentile rank values for the standard fit measures. Here we see that the value for Stationary R-squared at the highest percentile (Percentile 95) is 0.544. This means that Model_5 should be


Time Series Analysis and Forecasting with SPSS Trends ranked in the highest percentile based on this statistic, and the Total series should be ranked in the lowest. Now let’s look at the Residual plots.

Click on the Residuals tab Figure 8.26 Residuals Output for the Market_1 Series

The Residuals tab shows the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the residuals (the differences between expected and actual values) for each target field. The ACF values are the correlations between the current value and the previous time points. By default, 24 autocorrelations are displayed. The PACF values look at the correlations after controlling for the series values at the intervening time points. If all of the bars fall within the confidence intervals (the highlighted area), then there are no significant autocorrelations in the series. That seems to be the case with the Market_1 series. However, as we saw in Figure 8.24, the Market_2 series seemed to have significant autocorrelation based on the Ljung-Box Q statistic. Let’s take a look at the residuals plot for the Market_2 series to see if we can see why that statistic was significant for that series.

Use the Display plot for model: option to select the Market_2 series


Time Series Analysis and Forecasting with SPSS Trends Figure 8.27 Residuals Output for the Market_2 Series

Here we see that there is significant autocorrelation at lag 6 in both the ACF and PACF plots. Thus, the results of the Ljung-Box Q statistic and these two plots are consistent: there is a non-random pattern in the errors. What this implies is not that the current model can’t be used for forecasting, as it may perform adequately for the broadband company. But it suggests that the model can be improved. The Expert Modeler is an automatic modeling technique, and it normally finds a fairly acceptable model, but that doesn’t mean that some tweaking on your part isn’t appropriate.

Click OK Place the generated Times Series model from the Models palette onto the stream

canvas Connect the Time Intervals node to the generated model Place a Table node nearby the generated model Connect the generated model to the Table node Execute the Table node


Time Series Analysis and Forecasting with SPSS Trends Figure 8.28 Table Output Showing Fields Created by Time Series Model

The table now contains a forecast value for each time point along with an upper and lower confidence limit. In addition, there is a field called $T1_Future that indicates that there are records that contain forecast data. For records that extend into the future, the value for this field will be “1”.

Scroll to the bottom of the table and then slightly to the right


Time Series Analysis and Forecasting with SPSS Trends Figure 8.29 Table Output with Future Values Displayed

Notice that the original series all have null values on these last three records because they are into the future. On the right hand side in Figure 8.28, we can see the forecast values for future months (January 2004 to March 2004) for the Market_1 series. Finally, let’s create a chart showing the forecast for one of the series.

Close the Table window Place a Time Plot node near the generated model on the stream canvas Connect the Time Plot node to the generated model Select the following fields to be plotted: Market_5, $TS-Market_5, $TSLCI-Market_5,

$TSUCI-Market_5 Uncheck the Display Series in separate panels option Click OK


Time Series Analysis and Forecasting with SPSS Trends Figure 8.30 Sequence Chart for Market_5 along with Forecasts and Upper & Lower Confidence Limits

From this chart, it appears that the model fits this series very well.

Click on File…Save Stream As…Broadband.str

8.12 Applying Models to Several Series We just produced models for 6 series, along with forecasts for the next three months. Suppose that 3 months has passed and we now have actual data for January to March 2004 (for which we made forecasts initially). Now it is April 2004 and we want to make forecasts for the next three months (April to June 2004) using the same model we developed before without having to re-estimate the model now that we have updated the file with three months of new records. We do this with the Reuse stored settings method in the Time Series node to apply the model we just created to the updated data file. (We leave aside whether the correct forecast period is three months, more, or less.)

Click File…Open Stream…Broadband2.str (in the C:\train\ClemPredModel folder) Copy the generated Time Series node from the Broadband.str (or add it from the

Models manager to the stream Paste the generated model into Broadband2.str


Time Series Analysis and Forecasting with SPSS Trends Figure 8.31 Broadband2.str with the Generated Model from Broadband1.str.

This node contains the settings from the time series models we just created. Normally, at this point with any other Clementine generated model, we would make predictions on new data by attaching this node to the Type node and executing the generated model. This would make predictions for new cases. Time series data, though, is different. Unlike with other types of data files, where there is no special order to the cases (in terms of modeling), order makes a difference in a time series. To reuse our settings, but also use the new data (from January to March), to make estimates, we must create a new Time Series node directly from the generated Time Series model.

Right-click on the generated model and select Edit Click on Generate…Generate Modeling Node

This places a time series modeling node onto the palette.

Close the time series modeling output and delete the copied generated model from the stream canvas

Connect the Time Intervals node to the Time Series node


Time Series Analysis and Forecasting with SPSS Trends Figure 8.32 Broadband2.str with the Time Series Node Generated from the Previous Model

We don’t have to specify any outcome fields because the models, with all specifications, are already stored in the generated time series modeling node. We simply insert the model node and decide whether the model should be re-estimated or not. Assuming that you have recently estimated models, you might be willing to act as if the estimated parameters for the models still hold. You can avoid estimation, and apply the models to the new data by using the method Reuse Stored Settings option. This choice means that Clementine will use the model settings for both model form (type of exponential smoothing and ARIMA model) and the exact parameter estimates (e.g., value of an AR(1) nonseasonal term). If instead you wish to re-estimate the model parameters, the Expert Modeler choice means that Clementine will use the model form found in the model file, but will re-estimate the parameters. Although it will clearly take more computing time to re-estimate model parameters, unless you have many, many time series which are very long, re-estimating the parameters is usually the better choice. However, if you are, let’s say, making forecasts every month (week, etc.) based on just one additional month (week, etc.) of data, it may not be worth the effort to re-estimate every month. In that case, you may wish to re-estimate every few months.

Double-Click on the Time Series node Click the Method dropdown and select Reuse Stored Settings


Time Series Analysis and Forecasting with SPSS Trends Figure 8.33 Time Series Model Node with Reuse Stored Settings Selected

Click Execute to place a new model in the Models Manager Browse the new model


Time Series Analysis and Forecasting with SPSS Trends Figure 8.34 Time Series Model Output

As we can see, the models used for each series are the same as before (see Figure 8.24). Now let’s take a look at the new forecasts for April, May and June.

Attach the new model to the Time Intervals node Attach a Table node to the new Time Series model Execute the Table node


Time Series Analysis and Forecasting with SPSS Trends Figure 8.35 Table Node Output with New Forecasts

In summary, in this chapter we demonstrated how to make forecasts for several series at once and use the estimated models to re-estimate those models on new data and make new forecasts at a future date for those same series. The process of applying the models to new data can be repeated as often as necessary.



Summary Exercises


Exercise Data Information Sheets The exercises in this chapter is written around the data files broadband_1.sav.

1. Using the same data set that was used in text (broadband_1.sav), rerun the Time Series node, using series from the ones used in the chapter to fit a model and then produce forecasts.

2. Try rerunning the models requesting outlier detection. Does this make any difference in

the generated models?

3. For those with extra time: Try specifying your own exponential smoothing model(s) to see whether you can obtain a better model than that found by the Expert Modeler.



Chapter 9: Decision List

Overview • Introduce the Decision List model • Compare rule induction by Decision List with the decision trees nodes • Outline the main differences between a decision tree and a decision rule • Understand how Decision List models a symbolic output • Review the Interactive Decision List modeling feature • Use partitioned data to test a model (optional, already covered in former chapter)

Objectives We introduce the Decision List model, and then describe differences between Decision List and the decision tree algorithms. We then detail the expert options available within the Decision List modeling node. We will also demonstrate the Interactive Decision List feature.

Data In this chapter we use the data file churn.txt, which contains information on 1477 customers of a telecommunications firm who have at some time purchased a mobile phone. The customers fall into one of three groups: current customers, involuntary leavers and voluntary leavers. Unlike the models developed in Chapter 3, here we want to understand which factors influence the voluntary leaving of a customer, rather than trying to predict all three categories.

9.1 Introduction Clementine contains five different algorithms for performing rule induction: C5.0, CHAID, QUEST, C&R Tree (classification and regression trees) and Decision List. The first four are similar in that they all construct a decision tree by recursively splitting data into subgroups defined by the predictor fields as they relate to the outcome. However, they differ in several ways that are important to the user (see Chapter 3). Decision List predicts a symbolic output, but it does not construct a decision tree; instead, it repeatedly applies a decision rules approach. To give you some sense of a Decision List model we begin by browsing such a model and viewing its characteristics. After that we continue by reviewing a table that highlights some distinguishing features of the rule induction algorithms. Finally, we will outline the difference between decision trees and decision rules and the various options for the Decision List algorithm in the context of predicting symbolic outputs.

9.2 A Decision List Model Before diving into the details of the Decision List node, we review a decision list model.

Click File…Open Stream, and then move to the c:\Train\ClemPredModel directory Double-click DecisionList.str

Decision List 9 - 1


Figure 9.1 Decision List Stream

Right-click the Decision List node CHURNED[Vol] Select Execute

Once the Decision List generated model is in the Models palette, the model can be browsed.

Right-click the Decision List node named CHURNED[Vol] in the Models palette Click Browse

The results are presented as a list of decision rules, hence Decision List. If you are familiar with the C5.0 model output you will see a distinct likeness to the Rule Set presentation of a C5.0 model. Figure 9.2 Browsing the Decision List Model

Decision List 9 - 2

Predictive Modeling With Clementine The first row gives information about the training sample. The sample has 719 records (Cover (n)) of which 267 meet the target value Vol (Frequency). In consequence the percentage of records meeting the target value is 37.13% (Probability). A numbered row represents a model rule and consists of an id, a Segment, a target value or Score (Vol) and a number of measures (here: Cover (n), Frequency and Probability). As you can see, a segment is described by one or more conditions, and each condition in a segment is based on a predictive field, e.g. SEX = F, INTERNATIONAL > 0 in the second segment. All predictions are for the Vol category, as this is what is defined in the Decision List modeling node. The accuracy of predicting this category is listed for each segment in the Probability column, and accuracy is reasonably high for most segments. As a whole our model has 5 segments and a Remainder. The maximum number of predictive fields in a segment is 2. Each segment is not too small (see measure Cover (n)); the smallest has 52 records. This is not chance. The maximum number of segments in the model, the maximum number of predictive fields in a segment and the minimum number of records in a segment are all set in the Decision List node, as we will see later. We now review in some detail the Decision List model.

The Target A characteristic of Decision List is that it models a particular value of a symbolic target. In the Decision List model at hand we have modeled the voluntary leaving of a customer as represented by target value CHURNED = Vol.

The Remainder Segment The Remainder segment is yet another defining characteristic of the Decision List model. Unlike with decision trees, there will be a group of customers for which no prediction is made (the Remainder). The Decision List algorithm is particularly suitable for business situations where you are interested in a relatively small but extremely good (in terms of response) subset of the customer base. Think of customer selection for a marketing campaign, where there is a limited campaign budget available. So the marketer will only be interested in the top N customers she can afford to approach given her budget, and the rest (Remainder) will be excluded for the campaign.

Overlapping Segments In our model the 5 segments and the Remainder form a non-overlapping segmentation of the training sample, meaning that a customer (or a record) belongs to exactly one segment or to the Remainder. So the total of the Cover (n) for all segments including the Remainder should match the Cover (n) of the training sample. This basic requirement affects the way a particular segment should be interpreted when reading the model. The Nth segment should be interpreted as:

The record is in segment N and not(segment N-1) and not(segment N-2 ) and….. and not(segment 1)

Example

Decision List 9 - 3

Predictive Modeling With Clementine Given our model a Female customer with International >0 and AGE from 43 to 58 satisfies both segment 1 and segment 2. However she will be regarded as member of segment 1. The rules are applied in the order you see them listed for the segments, so this customer is assigned to segment 1. A customer belongs to segment 2 if:

not (SEX = F and 42 < AGE <= 58) [the segment 1 conditions] and SEX = F and International > 0

And a customer belongs to segment 3 if:

not (SEX = F and International > 0) [the segment 2 conditions] and not (SEX = F and 42 < AGE <= 58) [the segment 1 conditions] and SEX = F and 73 < AGE <= 89

This mechanism prevents multiple counting of customers in overlapping segments. Be aware that the order of the segments in the model affects the segment a customer belongs to and so also the measures Cover (n), Frequency and Probability for each model segment. This is a consequence of the iterative method by which Decision List generates rules. In a later section we will cover in detail how this rule induction mechanism works. For now it is sufficient to realize that the Decision list algorithm is constructing trees of decision rules using a very different splitting mechanism than the one used in the decision tree algorithms. This is the reason why the Decision List algorithm is not a tree but a rule algorithm.

9.3 Comparison of Rule Induction Models The table below lists some of the important differences between the rule induction algorithms available within Clementine. The first four columns are repeated from Chapter 3 for ease of comparison. Table 9.1 Some Key Differences Between the Five Rule Induction Models

Model Criterion

C5.0

CHAID

QUEST

C&R Tree

Decision List

Split Type for Symbolic Predictors

Multiple Multiple1 Binary Binary Multiple

Continuous Target

No Yes No Yes No

Continuous Predictors

Yes No2 Yes Yes No2

Criterion for Predictor Selection

Information measure

Chi-square F test for continuous

Statistical Impurity (dispersion) measure

Statistical

Can Cases Missing Predictor Values be Used?

Yes, uses fractiona-lization





Priors No No Yes Yes No Pruning Upper limit Stops rather Cost- Cost- Stops rather

Decision List 9 - 4

Predictive Modeling With Clementine Criterion on predicted

error than overfits complexity

pruning complexity pruning

than overfits

Build Models Interactively

No Yes Yes Yes Yes

Supports Boosting

Yes No No No Yes

1SPSS has extended the logic of the CHAID approach to accommodate ordinal and continuous target variables. 2Continuous predictors are binned into ordinal variables containing by default approximately equal sized categories. Unlike the decision tree algorithms, Decision List does not create subgroups by splitting but by either adding a new predictor or by narrowing the domain of the existing predictor(s) in the group (decision rule approach) and in consequence tree-splitting issues are not applicable here. Decision List can handle targets that are of type flag and set. Decision List is designed to model a specific category of a symbolic target, so effectively it predicts a binary outcome (target or not target). The algorithm treats continuous predictors by binning them into ordinal fields with approximately equal number of records in each category.. In generating rules, just like CHAID and QUEST, Decision List uses more standard statistical methods, as explained below. The way missing values are handled is set with Expert options. Either the missing fields in a predictor are neglected when it comes to using that predictor in forming a subgroup, or like CHAID, all missing values are used as an additional category in model building. The process of rule generation halts based on settings such as the maximum number of predictors in a rule, explicit group size related settings, and the statistical confidence required.

9.4 Rule Induction Using Decision List The Decision List modeling node must appear in a stream containing fully instantiated types (either in a Type node or the Types tab in a source node). Within the Type node or Types tab, the field to be predicted (or explained) must have direction OUT or it must be specified in the Fields tab of the modeling node. All fields to be used as predictors must have their direction set to IN (in Types tab or Type node) or be specified in the Fields tab. Any field not to be used in modeling must have its direction set to NONE. Any field with direction BOTH will be ignored by Decision List. The Decision List node is labeled with the name of the output field and target category. Like most other models, a Decision List model may be browsed and predictions can be made by passing new data through it in the Stream Canvas. The target field must have categorical values, and Decision List will model on a particular value of the target field. That target value is set in the Decision List node. The other values of the target field will then be regarded as a second category value, appearing as the value $null$ in predictions.

Decision List 9 - 5

Predictive Modeling With Clementine In this example we will attempt to predict which customers voluntarily cancel their mobile phone contract. Rather than rebuild the source and Type nodes, we use the existing stream opened previously. We’ll delete the Decision list node so we can review the default settings.

Close the Decision List Browser window Delete the CHURNED[Vol] node Place a Decision List node from the Modeling palette to the upper right of the Type

node in the Stream Canvas Connect the Type node to the Decision List node (see Figure 9.3)

The name of the Decision List node should immediately change to No Target Value. Figure 9.3 Decision List Modeling Node Added to Stream

The reason for the name “No Target Value” is because target field CHURNED has three values, but Decision List predicts only one specific target value.

Double-click the Decision List node to edit it

Note the message stating that a target value must be specified.

Decision List 9 - 6

Predictive Modeling With Clementine Figure 9.4 Decision List Dialog - Initial

The Model name option allows you to set the name for both the Decision List and resulting Decision List nodes. The Use partitioned data option is checked so that the Decision List node will make use of the Partition field created by the Partition node earlier in the stream. By default the model is built automatically, as the Mode is set to Generate model. By selecting Launch interactive session it is possible to create the model interactively. The Target value has to be set explicitly to Vol.

Click the button, to the right of the Target value Click Vol, then click Insert

With Decision List you are able to generate rules better than the average or worse than the average depending on your goal (where the average is the overall probability of the target value). This is set by the Search Direction value of Up or Down.. An upward search looks for segments with a high frequency. A downward search will create segments with low frequency. A decision rule model contains a number of segments. The maximum is set in Maximum number of segments. Each segment is described by one or more predictors, also known as attributes in the Decision List node. The maximum number of predictive fields to be used in a segment is set in Maximum

Decision List 9 - 7

Predictive Modeling With Clementine number of attributes. You may compare this setting with Levels below root setting in CHAID and QUEST, prescribing the maximum tree depth. The Maximum number of attributes setting implies a stopping criterion for the algorithm. Just like the stopping criteria of CHAID, Decision List also has settings related to segment size: As percentage of previous segment (%) and As absolute value (N). The percentage setting states that a segment can only be created if it contains at least a certain percentage of records of its parent. Compare this with a branch point in a tree algorithm. The absolute value setting is straightforward: a segment only qualifies for the model if it is not too small, thus serving the generality requirement of a predictive model. The larger of these two settings takes precedence. Note that whereas in CHAID’s stopping criteria you must choose either a percentage or an absolute value approach, Decision List combines the two by using the percentage requirement for the parent and the absolute value requirement for the child. The model’s accuracy is controlled by Confidence interval for new conditions (%). This is a statistical setting and the most commonly used value is 95, the default. Of course depending on the business case and how costly an erroneous prediction, you may increase or decrease this confidence value.

9.5 Understanding the Rules and Determining Accuracy The predictive accuracy of the rule induction model is not given directly within the Decision List node. To obtain that information using an Analysis node may be confusing let alone misleading as Decision List will only explicitly report on the particular target value that was modeled and the other value(s) will be regarded as $null$. To avoid that we will use Matrix nodes and Evaluation Charts to determine how good the model is. We use the Table node to examine the predictions from the Decision List model.

Click Execute to run the model Place the generated Decision List node named CHURNED[Vol] from the Models

palette in the Manager to the right of the Type node Connect the Type node to the generated Decision List node Place a Table node from the Output palette below the generated Decision List node Connect the generated Decision List node to the Table node Right-click the Table node, then click Execute and scroll to the right in the table

Decision List 9 - 8

Predictive Modeling With Clementine Figure 9.5 Three New Fields Generated by the Decision List Node

Three new columns appear in the data table, $D-CHURNED, $DP-CHURNED and $DI-CHURNED. The first represents the predicted target value for each record, the second the probability and the third shows the ID of the model segment a record belongs to. The sixth segment is the Remainder. Note that the predicted value is either Vol or $null$, demonstrating that the Decision List algorithm predicts a particular value of the target field to the exclusion of the others.

Click File…Close to close the Table output window

Comparing Predicted to Actual Values We will use a matrix to see where the predictions were correct, and then we evaluate the model graphically with a gains chart.

Place two Select nodes from the Records palette, one to the upper right of the generated Decision List node and one to the lower right

Connect the generated Decision List node to each Select node First we will edit the Select node on the upper right that we will use to select the Training sample cases:

Double-click on the Select node on the upper right to edit it Click the Expression Builder button Move Partition from the Fields list box to the Expression Builder text box Click the equal sign button

Click the Select from existing field values button and insert the value 1_Training (not shown)

Decision List 9 - 9


Click OK Click Annotations tab Select Custom and enter value Training Click OK


Now do the same for the Select node on the right to select the Testing sample cases. Insert Partition value “2_Testing” and annotate the node as “Testing.” Now attach a separate Matrix node to each of the Select nodes. For each of the Select nodes:

Place a Matrix node from the Output palette near the Select node Connect the Matrix node to the Select node Double-click the Matrix node to edit it Put CHURNED in the Rows: Put $D-CHURNED in the Columns: Click the Appearance tab Click the Percentage of row option Click on the Output tab and custom name the Matrix node for the Training sample as

Training and the Testing sample as Testing (this will make it easier to keep track of which output we are looking at)

Click OK For each actual risk category, the Percentage of row choice will display the percentage of records predicted in each of the outcome categories.

Execute each Matrix node

Decision List 9 - 10

Predictive Modeling With Clementine Figure 9.7 Matrix Output for the Training and Testing Samples

Looking at the Training sample results, the model predicts about 82.0% of the Vol (Voluntary Leavers) category correctly. The results with the testing sample compare favorably (80.5% accurate) which suggests that the model will perform well with new data. Note that technically no prediction for the other two categories is correct, since the model doesn’t predict Current or InVol but just $null$. But we can combine these results by hand to obtain the accuracy. The percentage of correct not Vol predictions is: (313 + 48)/((313 + 68)+(48 +23))*100 = 79.9%. We could have made this calculation easier by creating a two-valued target field based on CHURNED, thus creating a 2 by 2 matrix. Decision List would create the same rules for such a field.

Close both Matrix windows

To produce a gains chart for the Voluntary group:

Place the Evaluation chart node from the Graphs palette to the right of the generated Decision List node named CHURNED[Vol]

Connect the generated Decision List node to the Evaluation chart node Double-click the Evaluation chart node, and click the Include best line checkbox

By default, an Evaluation chart will use the first target outcome category to define a hit. To change the target category on which the chart is based, we must specify the condition for a User defined hit in the Options tab of the Evaluation node.

Click the Options tab Click the User defined hit checkbox

Click the Expression Builder button in the User defined hit group Click @Functions on the functions category drop-down list

Select @TARGET on the functions list, and click the Insert button Click the = button Right-click CHURNED in the Fields list box, then select Field Values



Select Vol, and then click Insert button Figure 9.8 Specifying the Hit Condition within the Expression Builder

Click OK Figure 9.9 Defining the Hit Condition for CHURNED

In the evaluation chart, a hit will now be based on the Voluntary Leaver target category.

Click Execute


Predictive Modeling With Clementine Figure 9.10 Gains Chart for the Voluntary Leaving Group

Click Edit…Enable Interaction The gains line ($D-CHURNED) in the Training data chart rises steeply relative to the baseline, indicating that hits for the Voluntary Leaving category are concentrated in the percentiles predicted most likely to contain this type of customer, according to the model.

Hold the cursor over the model line in the Training partition at the 40th percentile Approximately 77% of the hits were contained within the first 40 Percentiles.


Predictive Modeling With Clementine Figure 9.11 Gains Chart for the Voluntary Leaving Group (Interaction Enabled)

The gains line in the chart using Testing data is very similar which suggests that this model can be reliably used to predict voluntary leavers with new data.

Close the Evaluation chart window To save this stream for later work:

Click File…Save Stream As Move to the c:\Train\ClemPredModel directory (if necessary) Type DecisionList Model in the File name: text box Click Save

9.6 Understanding the Most Important Factors in Prediction An advantage of rule induction models, as with decision trees, is that the rule form makes it clear which fields are having an impact on the predicted field. There is no great need to use alternative methods such as web plots and histograms to understand how the rule is working. Of course, you may still use the techniques described in the previous chapters to help understand the model, but they often are not needed. In the Decision List algorithm the most important fields in the predictions can be thought of as those that define the best subgroups in the sample used for training the model at a certain stage in the process. Thus in this example the most important fields when using the whole training sample


Predictive Modeling With Clementine are SEX and AGE. Because the sample used for training the model gradually decreases during the stepwise rule discovery process, there will be other predictive fields coming to the surface as being most important. This intuitively makes sense. So in step 2 when finding the best second segment, using the whole training sample except the first segment, the most important fields turn out to be SEX and International. Similarly, when finding segment 3 and using the whole training sample except for the first two segments, SEX and AGE are again the most important predictors. The process continues until the algorithm is not able to construct segments satisfying the requirements, or stopping criteria are reached.

9.7 Expert Options for Decision List Now that we have introduced you to the basics of Decision List modeling, we will discuss the Expert options which will allow you to refine your model even further.

Double-click on the Decision List modeling node to edit it Expert mode options allow you to fine-tune the rule induction process.


Figure 9.12 Decision List Expert Options


Predictive Modeling With Clementine Binning Binning is a method of transforming a numeric field (of type Range) into a number of categories/intervals. The Number of bins input will set the maximum number of bins to be constructed. Whether this maximum will actually be the number of bins depends on other settings as well. There are two main types of Binning methods, Equal Count and Equal Width. Equal Width will transform numeric fields into a number of fixed width intervals. Equal Count is a more balanced binning method, and it will create intervals based on an equal number of records per interval. The three settings below this control details of the model process, described below. If Allow missing values in conditions is checked, the Decision List algorithm will regard being empty or undefined as a particular category that can be used as a condition in a segment. That may result in a segment such as “SEX = F and AGE IS MISSING”.

The Decision List Algorithm The Decision List algorithm constructs lists of rules based on the outcome of a tree. However, the tree is generated quite differently from the way it is done in the decision tree algorithms, so the word “tree” has to be regarded as a way to visualize the solution area and the rule generation process of the Decision List algorithm.

Process Hierarchy In order to understand the Decision List rule generation process, we must first realize that a decision list contains segments, with each segment containing one or more conditions, and each condition being based on one predictive field. This hierarchy is directly reflected in the rule generation process: a main cycle of generating the list’s segments and a sub cycle for each segment of constructing the segment’s conditions based on the predictive fields. The main cycle is also called the List cycle and the sub cycle is called the Rule cycle. In constructing the conditions on the lowest process level the algorithm also has a Split cycle where the binning is performed in case of continuous predictive fields.

Qualification A key question is: what makes one list better than the other and what makes a segment better than the other? For a list the accuracy is defined by:

List% = 100* SUM (Frequency) / SUM (Cover(n)), the Remainder excluded On a segment level, a segment at hand is better than another segment if:

(1) the probability of the target value on the segment is better (2) there is no overlap between the confidence interval of the segment at hand and the

confidence interval of the other segment. This interval is directly related to the setting for the Confidence interval for new conditions (%), as set in the Simple mode of the Decision List dialog and defined as Probability± Error), where Error is the statistical error in the prediction of the Probability.


Predictive Modeling With Clementine List Generation – Simple To simplify the argument we will describe the process given the setting Model search width = 1, meaning we will not create multiple lists simultaneously to choose from in the end. So we will assume one List cycle here.

Rule Generation Given the above, the rule generation process starts with the full Training sample to search for segments. The solution area is generated as follows: on the first rule level segments are constructed based on 1 predictive field. The best 5 (Rule search width) will be selected as starting points for a second rule level, resulting in a set of segments each described by 2 predictive fields. Again the best 5 are selected for the third rule level. This goes on till the last rule level, which is 5 (Maximum number of attributes), and indeed in principle the fifth rule level segments are described by five predictive fields. It is not always possible to refine a certain segment in a next step by adding a new predictive field. One of the reasons is the group size as set in As absolute value (N). The algorithm may come up with segments that are described by less than five predictive fields. On the other hand, refining a given segment in a next step can also be done by not adding a new predictive field, but by reconsidering an existing predictive field. This is set by Allow attribute re-use. (e.g., “Age between (20, 60)” in step 1 could be refined to “Age between (25, 55)” in level 2). So this is why in rule level N there may be segments having less than N predictive fields. A segment that is not refined anymore is called a final result, which is comparable to terminal nodes in a decision tree. If Model search width =1, out of all these final results the algorithm will return the best 5 (Maximum number of segments) based on the target value’s probability. Our previous model did create all five. The decision rule process may not be able to use all the “freedom” as set in the Rule search width (5) and in Maximum number of attributes (5). The main reasons are typically group size requirements and/or the statistical confidence requested.

List Generation – Boosting Just like C5.0, Decision List has a “boosting” mechanism. This is reflected in the setting Model search width. In describing the decision list algorithm we assumed Model search width to be 1. By setting a higher value (say 2) you direct the Decision List algorithm to consider 2 alternatives for each segment Thus the Decision Rule algorithm will deliver the best 2 segments after each Rule cycle. In our model this means that we instructed the algorithm to build a list of 5 segments. The List cycle will now have 5 iterations of the Rule cycle (= Maximum Number of segments) and each Rule cycle will have 5 iterations (= Maximum number of attributes) For the first segment on the list the Rule cycle will return the top 2 segments (= Model search width). Thus, now 2 lists are created each with 1 segment and a Remainder. On each of the 2 lists the Rule cycle will be performed on the Remainder. This will result in 4 lists, each with 2 segments and a Remainder. Out of these 4 lists the top 2 based on List% are selected to find a third, and so forth.


Predictive Modeling With Clementine When working in Interactive mode, the Maximum Number of alternatives setting is active. When the model is automatically generated, its values is set to 1. Be aware that the Model search width and the Rule search width have a direct impact on the data-mining processing time.

9.8 Interactive Decision List Decision lists can be generated automatically, allowing the algorithm to find the best model. As an alternative, you can use the Interactive List Builder to take control of the model building process. You can grow the model segment-by-segment, you can select specific predictors at a certain intermediate point, and you can insure that the list of segments is not too complex so that it is practical enough to be used for a business problem. To use the List Builder, we simply specify a decision list model as usual, with the one addition of selecting Launch interactive session in the Decision List node’s Model tab. We’ll use the Decision List Interactive session to predict the voluntary leavers.

Close the Decision List modeling node Click File…Open Stream Double-click on DecisionList Interactive.str On the Stream canvas double-click the Decision List node named CHURNED[Vol] Click the Model tab Click the Launch interactive session option button

Figure 9.13 Decision List Model Tab with Interactive session enabled


Predictive Modeling With Clementine Note that we have modified some of the default settings, such as the maximum number of attributes, the maximum number of segments, and the absolute value of the minimum segment size. Click on the Expert tab to review those settings as well. When the model executes, a generated Decision List model is not added to the Model Manager area. Instead, the Decision List Viewer opens, as shown in Figure 9.14.

Click Execute to open the Decision List Viewer

In the Decision List Viewer click on the Preview button in the bottom right corner to show the Preview pane

Figure 9.14 The Decision List Viewer

The working model pane The manager pane

The preview pane

The Decision List Viewer workspace provides options for configuring, evaluating, and deploying models. The workspace consists of three panes. The Working model pane displays the current model representation. The Preview pane displays an alternative model or model snapshot to compare to the working model. The Manager pane contains the Session Results tab and the Snapshots tab. The Session Results tab displays mining task results as well as alternative models. The Snapshots tab displays current model snapshots (a snapshot is a model representation at a specific point in time). Note: The generated, read-only model displays only the working model pane and cannot be modified. In the working model pane you can see two rules. The first gives information about the training sample. Here the sample has 719 records (Cover (n)) of which 267 meet the target value


Predictive Modeling With Clementine (Frequency). In consequence the percentage of records meeting the target value is 37.13% (Probability). The second, called Remainder, is now the first segment in our model and contains the whole training sample. This will be the starting point for building our Decision List model.

Right-click the Remainder segment From the dropdown list select Organize Mining Tasks…

Figure 9.15 Organize Mining Tasks Dialog

The first task and only task is the default task as defined in the Decision List Node. You have the choice of either executing, deleting, or modifying the task, or creating a new data mining task.

Click Execute to run the default task


Predictive Modeling With Clementine Figure 9.16 The Session Results Pane

In the Session Results pane a new entry appears after the task has finished. This entry has 1 main line and two sub lines. The main line states that the mining task was performed on the first segment (#1) and is completed. The two sub lines are the two alternative lists that were generated by this data mining task. Recall that for this task the Model search width is set to 2. The first alternative list (1.1 Alternative 1) contains 7 segments (7#) and the model represented by this list has an average probability of 59.36%. The second alternative list has 8 segments and the corresponding model has an average probability of 56.13%. Let’s view each of the two alternative lists

In the Session Results tab click on 1.1 Alternative 1 The result will be displayed in the Preview pane. Notice the line Model Summary in the bottom of the Preview pane.


Predictive Modeling With Clementine Figure 9.17 Preview of an Alternative List

In the Session Results tab click on 1.2 Alternative 2 (not shown) You will see that these two alternatives differ in their 7th segment. The first has a 7th segment based on SEX and the second on AGE. Another interesting segment is the Remainder. The first alternative has a Remainder of 281 and misses 7 voluntarily leaving customers, whereas the second alternative list has a Remainder of 254 and misses 6 of these customers. Assume that we prefer the first alternative but we want to capture some more of the voluntary leavers in the model. First we must promote the first alternative list to our working model, then from there we will continue the model building process.

In the Session Results tab, right-click on 1.1 Alternative 1 From the drop down list select Promote to Working Model …

The result will be displayed in the Working model panel.


Predictive Modeling With Clementine Figure 9.18 Promoting an Alternative to the Working Model

In the Session Results tab click on 1.2 Alternative 2 (not shown) We can now create a Gains chart for the working model.

Click Gains tab


Predictive Modeling With Clementine Figure 9.19 Gains Chart of Working Model

The results look encouraging on both the training data and the testing data. The segments included in the model are represented by the solid line; the excluded portion (Remainder) is represented by the dashed line. Let’s put both the Working model and the Preview model on display in the Gains chart

Click Chart Options Click Preview Model check box Click OK


Predictive Modeling With Clementine Figure 9.20 Gains Chart of Working Model and Preview Model

Although model performance is similar, the Preview model (alternative 2) performs a bit more poorly than the Working model.

Click Viewer tab In the Working model pane, right-click on segment SEX = F


Predictive Modeling With Clementine Figure 9.21 Options to Modify a Segment in the Model

Choices in the context menu allow you to modify the segments created by the data mining task. For example you may decide to delete a segment or to exclude it from scoring. You can even edit the segment. For example, you could add an extra condition to the segment ‘SEX = F’, or you could modified the lower and upper boundary value of EST_INCOME in segment 6 (Edit Segment Rule).

Model Assessment We have used the Gains chart above to get an overall view of the model. You can assess the model on a segment level by using the model measures. There are five types of measures available.

From the menu, click Tools…Organize Model Measures


Predictive Modeling With Clementine Figure 9.22 Organize Model Measures Dialog

When building a Decision List model, you have five types of measures at your disposal (Display): a Pie Chart and four numerical measures. Each measure has a Type, the Data Selection it will operate on (here Training Data) and whether it will be displayed in the model (Show). The Pie Chart displays the part of the Training sample that is described by a segment. The other Coverage measure is Cover (n), which will show the number of records in the Training sample in that segment. The Frequency measure displays the number of records in the segment with the target value, Probability calculates the ratio of Frequency over Cover(n) and the Error returns the statistical error. It is possible to add new measures to your model by clicking the Add new model measure button

. We’ll create a measure (call it %Test) showing the probability of each segment on the Testing partition. Furthermore. we will rename Probability to %Train.

Click the Add new model measure button This will create a new row named Measure 6.

Double-click in the Name cell for Measure 6 and change the name to %Test Click the dropdown list for Type and change to Probability


Predictive Modeling With Clementine Figure 9.23 Creating a New Measure

Click the dropdown list for Data Selection and change to Testing Data Click the Show checkbox for %Test Double-click in the Name cell for Probability and change its name to %Train, then hit

Enter Figure 9.24 Completed %Test Measure and Renamed Probability Measure

Click OK


Predictive Modeling With Clementine Figure 9.25 New Measures Added to the Working Model

Decision List Viewer can be integrated with Microsoft Excel, allowing you to use your own value calculations and profit formulas directly within the model building process to simulate cost/benefit scenarios. The link with Excel allows you to export data to Excel, where it can be used to create presentation charts, calculate custom measures, such as complex profit and ROI measures, and view them in Decision List Viewer while building the model. The following steps are valid only when MS Excel is installed. If Excel is not installed, the options for synchronizing models with Excel are not displayed. Suppose that we have created a template in Excel where, based on the Probability and on the Coverage of a segment, we calculate the amount of loss we will suffer should the customers in a segment actually leave voluntarily.

Click Tools and select Organize Model Measures Click Yes for Calculate custom measures in Excel (TM) Click Connect to Excel (TM) … button Browse to C:\Train\ClemPredModel\ and select Template_churn_loss.xlt Click Open


Predictive Modeling With Clementine Figure 9.26 The ExcelWorkbook for the Churn Case

Switch to Clementine using the ALT-Tab keys on your keyboard Figure 9.27 Excel Input Fields


Predictive Modeling With Clementine The Choose Inputs for Custom Measures window reveals that Excel expects two fields for input: Probability and Cover. On the other hand four fields are available to add to your model: Loss = Probability * Cover * Loss – Cover * Variable Cost %Loss = 100 * Loss / Sum (Loss), the fraction of the total loss a segment can be accounted for Cumulative = Cumulative Loss %Cumulative = % Cumulative Loss By default all are selected. Clicking on an empty Model Measure cell in the dialog will open a drop down list with all the measures available in your model.

Click in the Model Measure cell for Probability and select %Train Click in the Model Measure cell for Cover and select Cover (n)

Figure 9.28 Mapping Excel Input File to the Decision List Model Measure

Click OK In the Organize Model Measures window you will see which measures are available for input to your model. By default all are selected.

Deselect measure %Test (not shown) Click OK


Predictive Modeling With Clementine Figure 9.29 the Decision List Model with External Measures

As you can see, segment 4 is responsible for more than 20% of the total loss expected (reflected by its measure %Loss), and the first four segments for more than 50% (reflected by the %Cumulative measure for the fourth segment). So if the business objective was to select a set of customers in a retention campaign to reduce the expected loss by at least 50%, the list manager would probably choose the first 4 segments to be scored. If you wish to exclude a segment from a model, it can be done from a context menu.

Right-click on Segment 5


Predictive Modeling With Clementine Figure 9.30 Manually Excluding Segments from Scoring Based on External Measures

Interactive Decision Lists are not a model, but instead are a form of output, like a table or graph. When you are satisfied with the list you have built, you can generate a model to be used in the stream to make predictions.

Click Generate…Generate Model Click OK in the resulting dialog box (not shown) Close the Interactive Decision List Viewer window

A generated Decision List model appears in the upper left corner of the Stream Canvas. It can be edited, attached to other nodes, and used like any other generated model. The only difference is in how it was created.



Summary Exercises


Exercise Data Information Sheets The exercises in this chapter is written around the data files Newschan.sav.

1. Begin with a clear Stream canvas. Place an SPSS File source node on the canvas and connect it to Newschan.sav.

2. Try to predict whether or not someone responds to a cable new service offer

(NEWSCHAN). Start by using the default settings. How many segments were created? What fields were used? Does this model seem adequate?

Try different models by changing various settings, including the minimum segment size, allowing attribute reuse, confidence interval (change to 90%), or some of the expert settings. Can you find a better model?



Chapter 10: Finding the Best Model for Binary Outcomes

Objectives • Introduce the Binary Classifier Node • Use the Binary Classifier Node to predict customers who will churn

10.1 Introduction When you are creating a model, it isn’t possible to know in advance which modeling technique will produce the most accurate result. Often several different models may be appropriate for a given data file, and normally it is best to try more than one of them. For example, suppose you are trying to predict a binary outcome (buy/not buy). Potentially, you could model the data with a Neural Net, any of the Decision Tree algorithms, Logistic Regression, or Decision List. In certain situations, you may also be able to use Discriminant Analysis. Unfortunately this process can be quite time consuming. The Binary Classifier node allows you to create and compare models for binary outcomes using a number of methods all at the same time, and compare the results. You can select the modeling algorithms that you want to use and the specific options for each. You can also specify multiple variants for each model. For instance, rather than choose between the quick, dynamic, or prune method for a Neural Net, you can try them all. The node generates a set of models based on the specified options and ranks the candidates based on the criteria you specify. The supported algorithms include Neural Net, Decision Trees (C5.0, C&RT, QUEST, and CHAID), Logistic Regression and Decision List. To use this node, a single target field of type Flag and at least one predictor field are required. We will continue to use the Churn.txt file which we used in earlier chapters. However, we will have to combine the Voluntary and Involuntary Leavers into a single category in order to use this node. The predictor fields can be numeric ranges or categorical, although any categorical predictors must have numeric storage (not string). If necessary, you can use the Reclassify node to convert them.

Click File…Open Stream, and then move to the c:\train\ClemPredModel folder Double-click on FindBestModel.str Place a Binary Classifier node from the Modeling palette to the right of the Type node Connect the Type Node to the Binary Classifier node Edit the Derive node named LOYAL

Finding the Best Model for Binary Outcomes 10 - 1

Predictive Modeling With Clementine Figure 10.1 Creation of Flag Field Identifying Loyal Customers

In the Derive node we use the field CHURNED to create a new target with the name LOYAL. This target will be a flag, with a value of Leave when CHURNED is not equal to Current; this means that customers who are voluntary or involuntary leavers will have values of Leave. Current customers who will stay have will have a Stay.

Close the Derive node Edit the Binary Classifier node


Predictive Modeling With Clementine Figure 10.2 Binary Classifier Node

The maximum models listed in the Binary Classifier summary report is 20 by default, but you can increase or decrease this value. The Rank models by option allows you to specify the criteria used to rank the models. Note that the True value defined for the target field is assumed to represent a hit when calculating profits, lift, and other statistics. We have defined Leave as the True category in the Derive node because we are more interested in locating persons who will leave the company than those who will stay. Models can be ranked on either the Training or Testing data, if a Partition node is used.

Click on the Rank models by menu to see the different ranking options


Predictive Modeling With Clementine Figure 10.3 Ranking Options Within the Binary Classifier

Overall accuracy refers to the percentage of records that is correctly predicted by the model relative to the total number of records. Area under the curve (ROC curve) provides an index for the performance of a model. The further the curve is above the reference line, the more accurate the model. Profit (Cumulative) is the sum of profits across cumulative percentiles (sorted in terms of confidence for the prediction), based on the specified cost, revenue, and weight criteria. Typically, the profit starts near 0 for the top percentile, increases steadily, and then decreases. For a good model, profits will show a well-defined peak, which is reported along with the percentile where it occurs. For a model that provides no information, the profit curve will be relatively straight and may be increasing, decreasing, or level, depending on the cost/revenue structure that applies. Lift (Cumulative) refers to the ratio of hits in cumulative quantiles relative to the overall sample (where quantiles are sorted in terms of confidence for the prediction). For example, a lift value of 3 for the top quantile indicates a hit rate three times as high as for the sample overall. For a good model, lift should start well above 1.0 for the top quantiles and then drop off sharply toward 1.0 for the lower quantiles. For a model that provides no information, the lift will hover around 1.0. Number of variables ranks models based on the number of variables used. The Profit Criteria section is used to define the costs, revenue and weight values for each record. Profit equals the revenue minus the cost for each record. Profits for a quantile are simply the sum of profits for all records in the quantile. Profits are assumed to apply only to hits, but costs apply to all records. Use the Costs option to specify the cost associated with each record. You can either specify a Fixed or Variable cost. Use the fixed costs option if the costs are the same for each record. If the costs are variable, select the field which has the cost associated for each record. The Revenue option is used to specify the amount of revenue associated with each record. Again, this


Predictive Modeling With Clementine value can be either Fixed or Variable. The Weight option should be used if your data represent more than one unit. This option allows you to use frequency weights to adjust the results. For fixed weights, you will need to specify the weight value (the number of units per record). For variable weights, use the Field Selector button to select a field as the weight field. Note that model profit will have nothing to do with monetary profit unless you specify actual cost and revenue values. Nevertheless, the defaults will still give you some sense of how good the model is compared to other models. For example, if it costs you 5 dollars to send out a promotion, and you get 10 dollars in revenue for each positive response, the model with the highest cumulative profit would be the one with the most hits. Lift Criteria is used to specify the percentile use for lift calculations. The default is 30.

Select Lift from the Rank model by: menu Figure 10.4 Completed Binary Classifier Model Tab

Click the Expert tab


Predictive Modeling With Clementine Figure 10.5 Binary Classifier Expert Tab

The Expert tab allows you to select from the available model types and to specify stopping rules. By default, each of the seven model types are checked and will be used. However, it is important to note that the more models you select, the longer the processing time will be. You can uncheck a box if you don’t want to consider a particular algorithm. The Model parameters option can be used to change the default settings for each algorithm, or to request different versions of the same model. For example, you could request that all six of the Neural Net training methods in a single pass of the data. In this example, we will request one additional Neural Net model (the Dynamic method) and take the default values for all the other models.

Click on the Model Parameters cell for Neural Net and select Specify


Predictive Modeling With Clementine Figure 10.6 Algorithms Setting for Neural Net models

Click in the Method row in the Options cell and select Specify


Predictive Modeling With Clementine Figure 10.7 Neural Net Parameter Editor

At this point, we could check additional Neural Net algorithms. However, in the interest of time, we will stick with just the Quick method.

Click Cancel Before we move on, note that the Set random seed parameter is set to false. This means that the random seed for the neural net model(s) will be generated each time the Binary Classifier node is executed, and this will result in a somewhat different model each time (for each type of neural net requested). If you wish, you can set the Set random seed parameter to true, and then specify a seed with the Seed parameter. In the class, we will not do so to make the example more realistic, so expect your results to differ from that listed below or from the instructor.

Click OK again to return to the main dialog Click on the Stopping rules button


Predictive Modeling With Clementine Figure 10.8 Stopping Rules Dialog

Stopping rules can be set to restrict the overall execution time to a specific number of hours. All models generated to that point will be included in the results, but no additional models will be produced. In addition, you can request that execution be stopped once a model has been built that meets all the criteria specified in the Discard tab (see Figure 10.9).

Click Cancel Click the Discard Tab

Figure 10.9 Binary Classifier Discard Tab

The Discard tab allows you to automatically discard models that do not meet certain criteria. These models will not be listed in the summary report. You can specify a minimum threshold for overall accuracy, lift, profit, and area under the curve, and a maximum threshold for the number of variables used in the model. Optionally, you can use this dialog in conjunction with Stopping rules to stop execution the first time a model is generated that meets all the specified criteria. In this example, we will not set any discard criteria.



Click Execute Figure 10.10 Binary Classifier Results

Here we see that the CHAID model is the best based on the Lift statistic. The number to the right of the model type indicates the number of variations you requested with that algorithm. Note that there is a “1” to the right of each model. Use the Sort by: option or click on a column header to change the column used to sort the table. In addition, you can use the Show/hide columns menu

tool to show or hide specific columns and the button to change the cumulative lift percentile. The default value is 30. If a partition is in use, you can choose to view results for the training or testing partition as applicable. Because we did not use a Partition node, we are displaying the results from the Training set. The next step is to generate one or more of the models listed in the Binary Classifier Report browser. Each generated model can be used as is without having to re-execute the stream. Alternatively, you can generate a modeling node which you can add to your stream.

Check the Generate: box for CHAID Click Generate…Model(s) to Palette


Predictive Modeling With Clementine Figure 10.11 Binary Classifier Generate Options

Close the Binary Classifier browser Move the generated model to the Stream Canvas Connect the CHAID model to the Type node Place a Matrix node to the right of the CHAID model Connect the CHAID model to the Matrix node

Figure 10.12 Revised Stream with the Addition of the CHAID Model and a Matrix Node

Double-click on the Matrix node Put LOYAL in the Rows: Put $R-LOYAL in the Columns: Click the Appearance tab Click the Percentage of row option Execute the Matrix node


Predictive Modeling With Clementine Figure 10.13 Matrix Node Output

Here we see that the CHAID model correctly identified just over 90% of the Leavers. And while it didn’t predict current customers with the same degree of accuracy, the 79.7% figure would in all likelihood be very acceptable. In this way, we were readily able to run several models at one time, compare their results, and choose a model to examine further, or use to make predictions on future data.



Summary Exercises


Exercise Data Information Sheets The exercises in this chapter is written around the data files charity.txt. The following section give details of the file.

charity.sav comes from a charity and contains information on individuals who were mailed a promotion. The file contains details including whether the individuals responded to the campaign, their spending behavior with the charity and basic demographics such as age, gender and mosaic (demographic) group. The file contains the following fields: response Response to campaign orispend Pre-campaign expenditure orivisit Pre-campaign visits spendb Pre-campaign spend category visitb Pre-campaign visits category promspd Post-campaign expenditure promvis Post-campaign visits promspdb Post-campaign spend category promvisb Post-campaign visit category totvisit Total number of visits totspend Total spend forpcode Post Code mos 52 Mosaic Groups mosgroup Mosaic Bands title Title sex Gender yob Year of Birth age Age ageband Age Category

1. Begin with a clear Stream canvas. Place an SPSS File source node on the canvas and connect it to charity.sav.

2. Try to predict Response to campaign using all the available model choices. Use the

defaults first, and use the same inputs as you did in Chapter 2. Which model is best, and which is worse? You choose the criterion for ranking models, or look at more than one. Which one uses fewer inputs?

3. Now change some of the model settings on one more models and rerun the Binary

Classifier. Does the order of models change?



4. Pick two or more models and generate a model for each. Add them to the stream and use an Analysis node to further compare their predictions.



Chapter 11: Getting the Most from Models

Objectives • Discuss common approaches to improving the performance of a model in data mining

projects • Using confidence to improve models • Meta-modeling (combining models) • Modeling of errors

11.1 Introduction Throughout this course we have looked at several different modeling techniques, including neural networks, decision trees and rule induction, regression and logistic regression, and discriminant analysis. After building a model we have usually performed some form of diagnostic analysis that helps with the interpretation of the model, and we have also done additional analyses to help determine where the model is more and less accurate. In this chapter we develop and extend the model building skills learned so far. The key concept in these examples is that models built with an algorithm in Clementine should usually (unless accuracy is very high and satisfactory) be viewed not as the endpoint of an analysis, but as a way station on the path to a robust solution. There are various methods to improve models, only some of which we discuss here, and you are likely to come up with your own as you become experienced using Clementine. We provide methods for how to improve a model, but there is no one simple answer as to how this should be done. That is because the appropriate method is highly dependent upon characteristics of the existing model that has been built. Potential things to consider when improving the performance of a model are:

• The modeling technique used • The data type of the output field (symbolic or numeric) • Which parts of the model are under-performing, i.e., are less accurate • The distribution of confidence values for the existing model.

11.2 Modifying Confidence Values for Scoring Confidence values obtained from a model in Clementine reflect the level of confidence that the model has in the prediction of a given output, and they are only available for symbolic outputs, either flags or sets. Confidence values make no distinction between categories of an output variable; thus, for a flag with values of “yes” and “no,” confidence values can vary from 0 to 1 for predictions in each category. So for the churn data, a high degree of confidence does not help us determine whether that customer will stay or leave the company (it instead indicates the confidence that we have in the prediction of either stay or leave).

Getting the Most From Models 11 - 1

Predictive Modeling With Clementine But sometimes it would be helpful to modify the confidence so that, for the category of interest, a high confidence value means a prediction of leave, and a low confidence value indicates stay. Such a field is a type of score that can be used in choosing cases for future actions—intervention, marketing efforts, and so forth. For the examples in this chapter, we will use the churn data from a previous example. A Derive node has been added to the beginning of the stream to create a modified version of the CHURNED field. We convert CHURNED into the field LOYAL which measures whether or not a customer remained with the company. LOYAL groups together both voluntary and involuntary leavers into one group, so comparisons can be made with customers who remain loyal. We begin by opening the corresponding stream.

Click File…Open Stream and move to the c:\Train\ClemPredModel directory Double-click on Confidence.str

Both a neural net and C5.0 model were trained to predict the field LOYAL, using the ChurnTrain.txt data. Their generated models were then added to the stream connected to the Type node. Two Derive nodes were then added, one for each model, and given names corresponding to the score values they calculate based on the model predictions and confidence values (C5_SCORE and NN_SCORE). Then a histogram node displays each score with LOYAL as an overlay. Figure 11.1 Stream Calculating Scores from Predictions and Confidence Values

Let’s look at the Derive node, which calculates the C5_SCORE field.

Edit the Derive node C5_SCORE The new field C5_SCORE is created using two formulas. When the prediction of the C5.0 model is that a customer will stay the confidence score is calculated as:


Predictive Modeling With Clementine 0.5 – (model confidence/2) When the prediction is that a customer will leave (the category of most interest) the score is: 0.5 + (model confidence/2) Since the model confidence varies from .50 to 1 for a C5.0 model (it can’t be below .50 because the mode is used to make a prediction, and there are only two categories), the first equation will create scores ranging from 0 to .25, and the greater the confidence that a customer will stay, the closer the score will be zero (0.5 – 1.0/2). Conversely, the second equation creates scores ranging from .75 to 1.00, and the greater the confidence the customer will leave, the closer the score to 1 (0.5 + 1.0/2). Figure 11.2 Derive Node Transforming Predictions and Confidence Values into Scores

The Derive node for NN_SCORE is similar. However, since confidence varies from 0 to 1 for a neural network model, the resulting modified confidence will vary from 0 to .50 for customers predicted to stay, and from .50 to 1.0 for those predicted to leave, i.e., across the full range from 0 to 1. Before examining the distribution of these new fields, let’s review the values of $CC-LOYAL, the actual confidence values for the C5.0 model, overlaid by $C-LOYAL, the predicted value.

Close the Derive node Add a Histogram node to the Stream canvas, and connect the C5_SCORE Derive node

to it



Edit the Histogram node, select $CC-LOYAL as the Field to display and $C_LOYAL as the Color Overlay Field (not shown)

Execute the Histogram node We can see in the figure below that the confidence scores range from .50 to 1.0, but that a high confidence doesn’t necessarily indicate that we expect a customer to leave or stay, since there are customers in both categories at high confidence values (we would find the same pattern if we used the values of LOYAL, the actual status of customers). For this model, very high confidence is associated with the Stay category, but this sort of pattern would not be found in general. Figure 11.3 Distribution of Original Confidence Value by Predicted Loyalty

Now we can create the histogram with the modified confidence scores. We’ll look at the score field for each model in turn.

Close the Histogram window Execute the Histogram node named C5_SCORE

The distribution of the new C5_SCORE field is bimodal, with scores either near 0 or near 1.0. And those predicted to leave as customers all have scores above .75 (actually above about .90 or so), and those who are predicted to stay have scores below .25 (actually close to 0).


Predictive Modeling With Clementine Figure 11.4 C5_SCORE Distribution Overlaid by Predicted Loyalty

Next we’ll look at the field NN_SCORE.

Close the Histogram plot window Execute the Histogram node named NN_SCORE

The distribution of NN_SCORE is continuous throughout the range from 0 to 1 because the confidence values for a neural network model also have that range. But the main point, again, is that those customers predicted to leave have scores ranging from .50 to 1, and those predicted to stay have scores below .50.


Predictive Modeling With Clementine Figure 11.5 NN_SCORE Distribution Overlaid with Predicted Value of LOYAL

The new score fields can now be used to score a database, as is commonly done in many data-mining applications, so that customers can, for example, be selected for a marketing campaign based on their propensity to leave—the value of C5_SCORE or NN_SCORE. This is perhaps the chief advantage of creating the new field. But there is at least one other. A score field can be used in a new model to improve the prediction of LOYAL. The score fields do not perfectly predict the value of LOYAL (remember we have been using the predicted value of LOYAL, not the actual values, in our histograms; try running the histograms with LOYAL to see the difference), but they apparently have a high degree of potential predictive power. Clearly, this is based purely upon the way that C5.0 or the neural network has differentiated between customers who will leave or stay, but if the model has a high degree of accuracy (which it does in this case), then the score field(s) may act as a very good predictor for another modeling technique. If a more complex model, such as a meta-model were to be built, information on the score values from a C5.0 model could be used as an input to another modeling technique, such as a neural network. We shall look at this form of meta-modeling in the next section.

11.3 Meta-Level Modeling The idea of meta-level modeling is to build a model based upon the predictions, or results, of another model (another type of meta-modeling simply combines the results of two or more models). In the previous section, we used a stream which contained both a trained neural network and a trained C5.0 rule induction model which, when trying to predict the field LOYAL, had accuracy figures of 79.87 and 89.17%, respectively (we can check this by executing the Analysis nodes). The confidence values were then transformed into score fields.


Predictive Modeling With Clementine We can use the C5_SCORE as one of the inputs to a modified neural network model. We know that the C5.0 algorithm can predict loyalty with high accuracy; thus it is hoped that by inputting the C5.0 scores into a neural network analysis, the neural network may be able to correctly predict some of the remaining 11% of cases that the C5.0 model incorrectly classified.

Close the Histogram plot window Click File…Close Stream and click No if asked to save changes Click File…Open Stream and move to the c:\Train\ClemPredModel directory Double-click on Metamodel.str

The figure below shows the completed stream loaded into Clementine. It is fairly complex, but partially because it retains most of the original stream from the previous example. Figure 11.6 Meta-Model Stream

A Type node has been inserted after the node that creates C5_SCORE. If we are to build a model based upon results obtained from previous models, each of the newly created fields will need to be instantiated and have its direction set. We will be using both the new score field and the predicted value from the C5.0 model. A decision must be made as to which fields should be inputs to the new model. You can use all the original fields, or reduce their number since the C5_SCORE and $C-LOYAL fields will effectively contain much of the predictive power of the original fields. If the number of inputs wasn't large, then including them along with the two new fields in the new neural network will not appreciably slow training time, and that is the approach we take here. But you may wish to drop at least some of the fields that had little influence on the model, since including all fields can lead to over-fitting.

Execute the Table node attached to the Type node downstream of the C5_SCORE Derive node (to instantiate the Type node), and then close the Table window

Edit the Type node attached to the C5_SCORE Derive node


Predictive Modeling With Clementine Figure 11.7 Type Node Settings

In this example, we will use all the original input variables as predictors, plus the predicted value of LOYAL from the C5.0 model and the calculated score. The output field remains LOYAL. A Neural Net node has been attached to the Type node (and renamed MetaModel_LOYAL). We’ve set the random seed to 1000 so that everyone will obtain the same solution, and we use the Quick training method. Let’s run the model.

Close the Type node Execute the neural network MetaModel_LOYAL Browse the generated model, and click Expand All in the Summary tab


Predictive Modeling With Clementine Figure 11.8 Output from Meta-Model for LOYAL

We can see that, not surprisingly, the field $C-LOYAL is by far the dominant input within the model. The predicted accuracy of the model has increased from 89.17% to 90.05%. This is an improvement of approximately 0.8% on the accuracy of the original C5.0 model. This is admittedly a small improvement, but the original C5.0 model was already very accurate. Still, every improvement can be important. The Analysis and Matrix nodes connected to the generated meta-model can be used to further analyze the new model. It is worth mentioning that while meta-modeling is a highly accepted way of increasing the accuracy of a model, analysts can often find themselves in a position of over-fitting a model if care is not taken. The best way of protecting against this is to ensure that a validation sample of the data is taken before any modeling takes place. The initial models, along with the more sophisticated meta-models, can then be built on part of the original data and finally tested on the holdout data. The true new accuracy of the meta-model should be determined with a validation data sample. As with all holdout samples, one is looking for consistency in results between the training data and the holdout data.


Predictive Modeling With Clementine 11.4 Error Modeling Error modeling is another form of meta-modeling that can be used to build a better model than the original, and it is often recommended in texts on data mining. In essence, this method is straightforward. Cases with errors of prediction are isolated and modeled separately. Almost invariably, some of these cases can now be accurately predicted with a different modeling technique. However, there is a catch to this technique. In both the training and test data files we have an output field to check the accuracy of a model. Thus, in the churn data, we know whether a customer remained or left. But in real life, that is exactly what we are trying to predict. So how can we create a model that uses the fact that an error of prediction has occurred since, when applying the model, we won’t know whether the model is in error until it is too late, i.e., the event we are trying to predict has occurred? The answer to this dilemma is that, of course, we can’t, so we have to find a viable substitute strategy. The most common approach is to find groups of cases with similar characteristics for which we make a greater proportion of errors. We then create separate models for these cases, assuming that the same pattern will hold in the future. It becomes crucial to validate the models with a holdout sample when using this technique. In this section we shall build an error model on the churn data in order to investigate where the initial neural network is under-performing, and then improve it by modeling the cases more prone to prediction errors with a C5.0 model.

Close the Neural net model browser Close the current stream (click File…Close Stream and click No when asked to save

changes) and clear the Models Manager Click File…Open Stream Double-click on Errors.str Switch to small icons (right-click Stream canvas, click Icon Size…Small)

Figure 11.9 displays the error-model stream in the Clementine Stream canvas. The upper stream in the canvas includes the generated model from the neural network and attaches a Derive node to it. The Derive node compares the original target field (LOYAL) with the network prediction of the output ($N-LOYAL), calculating a flag field (CORRECT) with a value of “True” if the prediction of the neural network is correct, and “False” if it was not. The first goal of the error model is to use a rule induction technique, which can isolate where the neural network model is under-performing. This will be done by using the C5.0 algorithm to predict the field CORRECT. We chose a C5.0 model because its transparent output will provide the best understanding of where the neural network is under-performing. In order to ensure that the C5.0 model returns a relatively simple model, the expert options have been set so that the minimum records per branch is 15. Setting this value is a judgment call based on the number of records in the training data and the number of rules with which you wish to work (another approach would be to winnow attributes, an expert option). A Type node is used to set the new field CORRECT to direction OUT, and the original inputs to the neural network to IN. It would need to be fully instantiated before training the C5.0 model.


Predictive Modeling With Clementine Figure 11.9 Error Modeling from a Neural Network Model

In this example, the model has already been trained and added to the stream, labeled C5 Error Model. Let’s browse this model.

Edit the C5.0 generated model node labeled C5 Error Model We generated a ruleset from the C5.0 model because it makes it easier to view those rules for the False values of CORRECT. Again, we are trying to predict the values of CORRECT, which means we are trying to predict whether the neural network was accurate or not. There are four rules for a False value

Click the Show all levels button

Click the Show or Hide Instances and Confidence button (so instances and confidence values are visible)

These rules all have reasonable values of confidence, ranging from .667 to .818 (although you might prefer them to be a bit higher). Rule 1 tells us that for male customers who make less than about a minute of international calls per month and almost no long distance calls, and who are single (STATUS=S), we predict the value of CORRECT to be False, i.e., the wrong prediction. Some customers with these characteristics were correctly predicted to leave or stay, but the majority were not (81.8% were a false prediction).


Predictive Modeling With Clementine Figure 11.10 Decision Tree Ruleset Flagging Where Errors Occur Within the Neural Network

The next step is to split the training data into two groups based on the ruleset, one for predictions of True and the other for False. We can do this by generating a Rule Tracing Supernode from the Rule browser window and applying a Reclassify or Derive node to truncate the values of the new field to just True and False. We will use the Reclassify node to modify the Rule field so that it only has two categories, which we will rename as Correct and Incorrect. Let’s check the distribution of this field.

Close the C5.0 Model browser window Execute the Distribution node named Split


Predictive Modeling With Clementine Figure 11.11 Distribution of Split Field

The neural network accuracy was 78.70%. The distribution of Split doesn’t match this because we limited the records per branch to no lower than 15, and because the C5.0 model can’t perfectly predict when the neural network was accurate or not. There are clearly enough cases with a value of Correct (992) to predict with a new model, but there are only 116 cases with a value of Incorrect, which is a bit low for accurate modeling. The best solution is to create a larger initial sample so that the 10% or so of cases predicted to be incorrect by the C5.0 model would be represented by a larger number of cases. If that isn’t possible, you can use a Balance node and boost the number of cases in the Incorrect category (although this is not an ideal solution). Since this is an example of the general method, we won’t bother doing either to see how much we can improve our model with no special sampling. Looking back at the stream, we next added a Type node to set the direction of FALSE_TRUE and Split to NONE so that they are not used in the modeling process. We wish to use only the original predictors. The stream then branches into two after the Type node. The upper branch uses a Select node to select only those records with predictions expected to be correct, while the lower branch selects those records with predictions expected to be incorrect. We reemphasize that the split of the training data is not based on the output field. Instead, only demographic and customer status fields were used to create the field Split used for record selection. It is for this reason that this model can, if successful, be used in a production environment to make predictions on new data where the outcome is unknown. After the data are split, the customers for whom we generally made correct predictions are modeled again with a neural network. We do so because these cases were modeled well before with a neural network, so the same should be true now. And, with the problematic cases removed, we expect the network to perform better. For the customer group for which predictions were generally wrong, we use a C5.0 model to try a new technique, since the neural network tended to mispredict for this group. We could certainly try another neural network, however, or any other modeling technique. After the models are created, they are added to the stream, and Analysis nodes are then attached to assess how well each performed. Let’s see how well we did.



Close the Distribution plot window Execute both Analysis nodes in the lower stream

The neural network model for the group of customers with generally originally correct predictions is correct 83.87% of the time, a substantial improvement over the base figure of 78.7%. The C5.0 model is even more accurate, correctly predicting who will leave or stay for 88.79% of the cases that were originally difficult to accurately predict. Clearly, using the errors in the original neural network to create new models has led to a substantial improvement with little additional effort. If you take this approach, you would, as usual, explore each model to see which fields are the better predictors and how this differs in each model. Figure 11.12 Model Accuracy for Two Groups

So far so good, but we’d still like to automate the solution so that the data all flow in one stream rather than in two, and we can therefore make a combined prediction for LOYAL on new data. This is easy to do. To demonstrate, we open a stream with a modified version of the current one.

Close the current stream (click File…Close Stream and click No when asked to save changes)

Click File…Open Stream and move to the c:\Train\ClemPredModel directory Double-click on Combined_predictions.str Switch to small icons (right-click Stream canvas, click Icon Size…Small)

We have combined the two generated models in sequence in this modified stream. You might think that we could simply combine the output from each model, since each was trained on a different group of cases and thus will make predictions only for those cases, but this isn’t the case. Although each model was trained on only a portion of the data, each will make predictions for all the cases. (Why? To verify this, execute the Table node.)


Predictive Modeling With Clementine Figure 11.13 Combined Predictions Stream

But the solution is simple. We know that the value of the field Split tells us which model’s output to use, and we do so in the Derive node named Prediction.

Edit the Derive node named Prediction This node creates a new field called Prediction. When Split is equal to Correct, the value of Prediction is set to the output of the neural network output. Otherwise, the value of Prediction is set to the output of the C5.0 model. Thus, we have a new field that contains the combined prediction from the best model for that group of customers.


Predictive Modeling With Clementine Figure 11.14 Derive Node to Create Prediction Field

We know that the baseline neural network had an accuracy of 78.7%, and made 236 errors. We will do much better with these two models. To see how much, we can execute the Matrix node that crosstabulates Prediction and LOYAL.

Close the Derive node Execute the Matrix node named Prediction x LOYAL

The combined models have made only 173 errors, quite an improvement. This translates to an accuracy of 84.38%, or an increase of about 5.7% over the original neural network model.


Predictive Modeling With Clementine Figure 11.15 Comparison of Prediction and LOYAL

The process of modeling errors need not stop here. Although there will clearly be diminishing returns as the number of errors decreases, it is certainly possible to attempt to separately model the remaining errors from the combined model. At the very least, you would still want to investigate those customers whose behavior remains difficult to model. Eventually you would validate the models with the ChurnValidate.txt dataset. We won’t do that here because the stream with the C5.0 model predicting errors in the original neural network has only 33 records, not enough for a reasonable validation. Obviously, the validation dataset should be of sufficient size, just as with the training file. We should also note that this same technique could be used for output fields that are numeric, either integers or real numbers. In that case, the errors are relative, not absolute, but some numeric bounds can be specified to differentiate cases deemed to be in error from those with sufficiently accurate predictions. Then the former group of cases can be handled in a similar manner as was done above.



Summary Exercises

A Note Concerning Data Files In this training guide files are assumed to be located in the c:\Train\ClemPredModel directory. In these exercises we will use the streams created in the chapter.

1. Use the stream Metamodel.str. Rerun the MetaModel_LOYAL neural network model, removing all the original inputs from the model and thus using only the modified confidence score and the predicted value from the C5.0 model. How does this affect model performance? Add this generated model to the stream and validate it with the ChurnValidate.txt data file. Was the model validated, in your judgment?

2. Use the stream Errors.str. Instead of using a C5.0 model to predict cases with

proportionally more errors, try another neural network. How well does this perform compared to the C5.0 model? How does it compare to the accuracy of the original neural network? Do you recommend that we use a neural network for these cases?


predictive modeling with clementine v11_1

Documents

trademarks of spss

spss software products

technical data

cleaning data

balancing data

data partitioning

predictive modeling

trademarks of ibm corporation