sas_e-miner2000 (2)

32
Modeling Credit Risks: A Practice Lesson for SAS Enterprise Miner Fall 2000 Note: Learning data mining can be challenging. Enterprise Miner is a complex software package that requires significant hands-on experience to master. This exercise is designed simply to be a ‘test drive’ so you can get an idea how one popular software package, SAS Enterprise Miner, can be used to create models that can solve real-world business problems. This exercise may require several hours to complete. It is recommended that you read this document prior to arriving at the lab so you can get an idea of the steps that you’ll take. While the software is very powerful, at times it may not seem user-friendly. Start early and feel free to ask your instructor or fellow classmates for assistance if you’re stuck. Be patient and feel free to re-read this text or run the exercise over again. Performing this exercise several times may clarify concepts beyond simply repeating the commands to get Enterprise Miner to run. Good luck!

Upload: nu-han-si

Post on 16-Nov-2015

225 views

Category:

Documents


1 download

DESCRIPTION

sas

TRANSCRIPT

SAS Enterprise Miner Lesson

Modeling Credit Risks:

A Practice Lesson for SAS Enterprise MinerFall 2000

Note:

Learning data mining can be challenging. Enterprise Miner is a complex software package that requires significant hands-on experience to master. This exercise is designed simply to be a test drive so you can get an idea how one popular software package, SAS Enterprise Miner, can be used to create models that can solve real-world business problems. This exercise may require several hours to complete. It is recommended that you read this document prior to arriving at the lab so you can get an idea of the steps that youll take. While the software is very powerful, at times it may not seem user-friendly. Start early and feel free to ask your instructor or fellow classmates for assistance if youre stuck. Be patient and feel free to re-read this text or run the exercise over again. Performing this exercise several times may clarify concepts beyond simply repeating the commands to get Enterprise Miner to run. Good luck!

This lesson has been adapted from the Enterprise Miner: Applying Data Mining Techniques course offered by SAS Institute

INSTRUCTIONS & TIPS FOR RUNNING THE LESSON

The computer lab on the second floor of Fulton has been set up to allow students to work with Enterprise Miner. The following notes are provided to assist the student in accessing Enterprise Miner and sample data sets.

Logging Into the Lab Computers

Enter your Boston College username in the User Name box

Enter your Boston College personal identification number (PIN) in the Password box

In the Domain box, select A-H, I-P, or Q-Z from the drop down list. You should choose the domain that corresponds to the first letter of your username. For example, if your username was GALLAUGH, you would choose A-H.

Launching the Application

You can start the Enterprise Miner application by clicking:

Start, Programs, Enterprise Miner, Enterprise Miner

When launched from this shortcut, the application will initialize the sample data set required to run the following lesson.

Protecting Your Work

The project you create will be automatically saved to the hard disk. Additionally, you will be instructed in the following directions to manually save certain parts of your project as you run the practice lesson.

If you need to leave before completing the lesson, please be sure to save your work to a zip disk. Only one lesson can be saved to the hard disk of the computer at a time so if your project is there and another student needs to use the computer for the SAS exercise, your project will be deleted from the hard disk.

Saving Work to Zip Disks Note: while it is possible to save work to a floppy disk, work files will almost certainly exceed the capacity of a single disk. If you want to save your work (you do not have to), it is advised that you use a Zip disk. Individual Zip disks can be purchased from the BC Tech Products Center in the Service Building (the building with the tall chimney across from Campion & behind Cushing) for $12 ea. They are also available in most office products stores such as Staples and OfficeMax.

You can save your work to a Zip disk by exporting your project. You should perform these steps only when you're through working at that computer. If you export files before you're done working, then any changes made since your export will not be saved.

Close all windows until the 'Enterprise Miner: Available Projects' window appears.

Highlight the project name from the project window.

Choose export from the file menu

A box will appear that allows you to specify a Project Transport File. Click on the right arrow and go to the Zip drive.

Name the file and click Save.

Click Next.

Click Finish. The file you created on the floppy disk contains your project work.

You can open work you have saved to a Zip disk by importing your project.

With the projects window open, go to import in the file menu.

Click on the right arrow and go to the Zip drive.

Highlight the file that contains your project (export file from above) and click Open.

Click Finish Click Next.

A box will appear and you should confirm that the path for EMPROJ and EMDATA are specified correctly. Click Next.

Click Finish. You will receive a message telling you that your project has been successfully imported.

Further Information

Students interested in learning more about the SAS Enterprise Data Miner software might want to purchase the book Getting Started with Enterprise Miner(TM) Software. Its available for about $10 from most large Internet bookstores.

PROBLEM FORMULATION

In this example, you will use actual data to mirror a real-world application of data mining. The consumer credit department of a bank wants to automate the decision making process for approval of home equity lines of credit. To do this, they will follow the recommendations of the Equal Credit Opportunity Act to create an empirically derived, statistically sound, and clearly interpreted credit rating model for the current process of loan underwriting. The model will be built from predictive modeling tools, but due to federal regulations, the created model must be sufficiently interpretable so as to provide a reason as to why a given applicant was rejected for a loan. SAS Enterprise Miner will allow you to look at historical data on loan performance. Using Enterprise Miner, you will use various techniques to recognize patterns in the data and build reliable models. With Enterprise Miner you will build and compare three models using different statistical techniques (decision trees, regression, and neural networks). The exercise will take you as far as building the three models and assessing their performance. Not shown is the next step taking the model and incorporating it into a system to assist in the decision of whether or not to grant a loan to an applicant. SAS Enterprise Miner does have the capabilities to apply the created model to an existing set this is called scoring in SAS terminology, but we will not score additional data in this example.

The HMEQ data set contains baseline and loan performance information for 5,960 recent home equity loans. The target or dependent variable (BAD) is a binary variable indicating whether an applicant eventually defaulted or was seriously delinquent on a loan. This adverse outcome occurred in 1,189 cases (20%). The data consists of 12 input variables for each applicant with information gathered from both the application and from an external credit bureau. The independent variables used in the data are defined as follows:

SourceVariablesDescription

ApplicationREASONDebit consolidation or home improvement

JOBSix occupational categories

LOANAmount of loan request

MORTDUEAmount due on existing mortgage

VALUEValue of current property

DEBTINCDebit to income ratio

YOJYears at present job

Credit BureauDEROGNumber of major derogatory reports (bankruptcies, foreclosures, charge-offs, etc.)

CLNONumber of trade lines

DELINQNumber of delinquent trade lines

CLAGEAge of oldest trade line in months

NINQNumber of recent credit inquiries

In a real-world environment, the model you are creating would be used for credit scoring. The credit scoring model will give a probability of a given loan applicant defaulting on loan repayment. Your job in this exercise is to experiment with several statistical techniques (neural networks, multiple regression, and decision trees) to find a model that creates the most reliable credit scoring method so that the firm can minimize its risk of loan defaults.

OBJECTIVES

This practice lesson allows users a chance to build three different statistical models a standard regression, a decision tree, and a neural network to evaluate the credit worthiness of a loan applicant. Output from SAS Enterprise Miner will allow you to make comparisons between the three models so that you can chose the most appropriate technique.

MODEL BUILDING SETUP

When the Enterprise Miner application is launched, the Available Projects window will open. Projects are built to analyze previously collected data. We first need to create a project and specify where the data being interpreted in the model should be stored.

The Enterprise Miner creates several files (SAS data sets, SAS data views, etc...) as it builds a

project flow diagram. These files are placed into libraries. Libraries do not contain raw data but are, rather, a location to store and organize raw data while it is being analyzed. As part of creating a project, you will specify the location and names of the libraries you want to use.

The next step will be to build a new workspace from the window in order to graphically illustrate the data analysis process (flow). Each node added to the workspace represents a step in the model being created. Steps to any statistical model could include assigning an input data source for getting data, and data partitioning for selecting a subset of the data to use to build a model, and assessing the results of one model vs. the other.

Create a New Project

1. Select Project from the Insert menu (in the top toolbar). A New Projects dialog box will appear.

2. Give your project a name by typing in the Name box (your Boston College username would be a good choice).

3. In the Project section, type EMPROJ in the library box. The path (C:\Workshop\dmem\EMPROJ) should be filled in automatically.

4. In the Data section, type EMDATA in the library box. The path (C:\Workshop\dmem\EMDATA) should be filled in automatically.

5. Click OK.

Note: It could be possible to get an error message at this point that says, Enter a new path. The path C:\Workshop\dmem\EMPROJ already contains a project. If this error message appears, follow these steps:

a) Click OK to clear the error message. You will be returned to the Save dialog box.

b) Click Cancel.

c) Delete all the projects in the new project window by right mouse clicking on the existing project and selecting Delete.

d) Choose Yes when asked to confirm the deletion of the existing project.

e) Return to step 1 above and recreate your new project. Your project will be able to use the C:\Workshop\dmdm\EMPROJ directory so you will not get an error message at step 5.

Build the Workspace

In order to build models and assess data you will build a process flow diagram in the Enterprise Miner workspace that graphically illustrates the steps that youll take. Each step in the process has a node (indicated by an icon in Figure 1 that represents a particular function used in the data mining process). The purpose of the nodes listed in Figure 1 are briefly described below.

Data Partition partitions data so that a subset is used to build (or train) the model, while another subset is used for verifying the validity of the built model.

Data Replacement some techniques (like Regression) dont work with missing values. The data replacement node will replace missing values with a specified or calculated default value. You will include the data replacement node below because (as youll see) there are some missing values in the data being analyzed.

Regression a statistical technique that models a linear relationship in data. The final model can be expressed as a linear model.

Decision Tree a statistical technique that attempts to establish decision making rules from data. The final model can be expressed as a series of if/then/else rules.

Neural Network (not yet shown) a statistical technique that is good at identifying difficult to recognize, non-linear relationships in data. A problem with neural networks is that the results produce a sort of black box model that is difficult to interpret (e.g. as opposed to a discernable linear equation produced by regression or the if/then/else rules of a decision tree).

Assessment allows for a cross-comparison of the predictive reliabilities of all of the models that point into this node.

Add Nodes: Now that you know the purpose of the nodes, build the project workspace by adding and connecting nodes

6. In the Available Projects window, open the project that you just created (click the + sign next to the right of the projects name) and double click on (projectname) diagram. A new Workspace window will appear.

7. Right mouse click anywhere in the white space in the Workspace menu and then click Add node. Click on Input Data Source in the new Add Node list. An icon labeled Input Data Source should appear in the Workspace area.

8. Repeat the previous step, adding five additional icons to the workspace area:

Data Partition Data Replacement Regression Decision Tree Assessment

Arrange & Connect Nodes

9. Click and drag icons around the Enterprise Miner workspace to arrange them in roughly the order that is shown in the diagram below.

10. Hold the mouse over the Input Data Source icon until a + appears. Click and drag a line from the Input Data Source icon to the Data Partition icon. (After youve drawn a line, you can click anywhere in the Workspace box and an arrow will appear at the end of the line.)

Note: when connecting nodes it is important that the node is not highlighted. If the node is highlighted (so that a dotted outline surrounds the node), it will be impossible to create an arrow. To remove highlighting, just click somewhere within the white space in the workspace window, then try again to draw a line between nodes. If you mess up, you can delete a connection simply by clicking on a line and pressing the delete key.

11. Using the procedure outlined above, continue to draw arrows between the icons so that the diagram looks like Figure 1 below.

Figure 1: The completed diagram which indicates the steps to create and assess regression & decision tree models for our loan example.

Note regarding the data replacement node: Data replacement allows you to specify methods to impute (or replace) missing values in the data set. This is not done because we want to do it but because we have to do it. Running a regression or a neural network model on a set of observations will result in an analysis of the "complete" observations -- or observations with no missing values. If you were to run a regression analysis or neural network model without doing imputation, you would be building your analysis on only a subset of the data. Any observations with "missing" values on a variable would be ignored for analysis purposes. This is a short-coming of regression and neural net models which is not present in decision trees. You will never see an imputation node BEFORE a decision tree -- trees handle missing values directly. While SAS allows for several options regarding how to replace missing data, for this exercise we'll accept the defaults offered by this node.

PREPARING DATA FOR MODELING

Well begin the exercise by building a regression model. The Equal Credit Opportunity Act mandates interpretability for credit scoring models. As mentioned earlier, both standard regression models and decision tree models are considered easy to interpret and are therefore accepted methods for determining the creditworthiness of a loan applicant.

Input Data For the Model

Setup the Input Data Source Node

1. Double click on the Input Data Source icon in the workspace to open the node.

2. Click the Select button next to Source Data.

3. Choose the CRSSAMP library from the pull down menu that appears when you click on the downward pointing triangle. Highlight the HMEQ data set from the dialog box that appears, then click the OK button. You have just set up the Input Data Source node to read input from a file named HMEQ that contains the raw historical loan data.

Identify the Target (Dependent) Variable for the Model and Check/Correct the Format of Input Data

4. Click the variables tab. A list of data variables will appear.

5. Since you are looking to see what variables / factors are associated with bad loans, you will need to specify that the BAD variable is the models dependent (or target) variable. Change the initial setting from BAD from input variable to target (dependent) variable by right clicking on input (in the model role column for the BAD variable row) and then left clicking on set model role, then select target.

6. Confirm that the Measurement scale for DEROG is set to interval. (For proper analysis in Enterprise Miner, DEROG needs to be set to interval. If it is not, you should right mouse click in the cell intersecting the Measurement column with the DEROG row, choose set measurement, and select interval.)

Why did we do that last step? This step was due to a quirk in how the data in the HMEQ file was stored. Depending on how the data set on your PC was configured, that data may have assumed that the DEROG variable was ordinal (indicating first, second, third, etc.). For purposes of the analysis, this value should be set to an interval value indicating the # of major derogatory reports that a particular applicant has received. If you set the DEROG to interval, then you just corrected an error in the way the data was formatted.

Explore Your Data by Examining Distribution Plots

Before using the Enterprise Miner to build your model, you may be curious as to the distribution of some of the variables that you are analyzing. You can examine the distribution plots for the individual variables as desired. For example, examine the distribution plot for the dependent variable BAD

1. Click to highlight the line that contains the BAD variable (you may have to scroll to find it).

2. Right mouse click BAD

3. Choose View distribution of BAD.

Note that about 20% of the loans in the data set were bad / defaulted loans (BAD=1). You can get a rough idea of the percentage of bad loans by looking at the diagram, or you can get an exact percentage by clicking the view info icon in the toolbar (it looks like an arrow with a question mark), then clicking on the bar that you'd like to get an exact count on (the exact percentage of bad loans is 20.9%). Click OK to close the chart.

Also examine the distribution plot for the independent variable DEBTINC.

4. Click to highlight the line that contains the DEBITINC variable (you may have to scroll to find it).

5. Right mouse click DEBTINC

6. Choose View distribution of DEBTINC. The graph produced should be similar to Figure 2Note that most of the debit to income ratios in the data set are less than about 45. When done viewing the diagram, you can click OK to close the chart.

Figure 2Examine the Data More Closely

7. Select the Interval Variables tab back on the Input Data Source dialog box.

Note the high missing rate for DEBTINC (you may have to scroll to find this variable). More than 20% of the applicants have a missing value in this variable. Some method will be needed to handle the missing values in the data set (thats what well use the Data Replacement node for).

8. Close the Input Data Source dialog box (click the close box or X in the upper right-hand corner of the dialog box). You will be asked if you want to save changes to the settings. Click Yes.

Partition the HMEQ Data for Modeling.

Most times when creating a model we dont want to use the entire data set in the models creation. Instead we want to hold back a portion of the data set so that we can validate the results of the model to see how reliable a predictor it will be. This way we can check to see if the results of the model are repeatable or if they are a statistical fluke. SAS refers to the portion of the data used to build the model as the training sample, and the portion used to evaluate the model as the validation sample.

1. Double click on the Data Partition icon in the workspace to open the node.

2. Select the Partition tab.

3. Set Train, Validation, and Test to 67, 33, and 0 respectively (The number pad on the keyboard may not work. Try the numbers on the top row.). Note that if you do not enter percentages that add up to 100, SAS will not let you exit this dialog box.

4. Close the Data Partition dialog box. You will be asked if you want to save changes to the settings. Click Yes.

CREATING AND COMPARING REGRESSION AND DECISION TREE MODELSFor this analysis, we will use the default settings for the Regression and the Decision Tree nodes.

1. From the Enterprise Miner Workspace, right mouse click on Assessment.

2. Click Run. The Miner will work for about 3 minutes running the models.

3. Click Yes when asked if you want to view results. A dialog box with the title Assessment Tool will appear.

4. Select both models by holding the control key and selecting decision tree and regression under tools in the models tab.

5. In the Tools menu (top tool bar), select Lift Chart. A cumulative gains chart will appear similar to Figure 3. You can maximize this window (click on the maximize icon in the upper right hand corner of the window) to view the chart in a full screen and see the complete legend for the appropriate Tool Name (or model).

Figure 3: A Cumulative %Response chart is shown by default. This chart sorts people by their probability of response (as predicted by the model), and then groups them into deciles (top 10%, next 10%, etc....). The chart then plots the actual percentage of respondents in each decile using the value of BAD.

Analysis of Standard Regression and Decision Tree ModelsThe chart shows the decision tree models performing better than the regression model. The first decile of the decision tree (refer to top line in graph) contains more than 80% bad loans. By comparison, the regressions first decile (refer to middle line on graph) contains only about 65% bad loans. This suggests that the decision tree model that you just created is a more accurate predictor of bad loans than the regression model. Also, the baseline line at the bottom of the screen suggests that without a model, only about 20% of bad loans would appear in the first decile. (Remember that the histogram we created of the BAD variable earlier showed that about 20% of all responses were BAD=1, so the baseline result just reflects a random sampling of BAD=1 results without any statistical measures applied to them).

1. Select the %Captured Response radio button.

The Target Concentration curve (%Captured Response) shows similarly dramatic results. By rejecting the worst 30% of all applications using the decision tree (see top line), you eliminate more than 80% of the bad loans from your portfolio. The same performance from the regression model (middle line) requires rejecting almost half of the applicants.

Clearly the default decision tree model is a better choice. Several questions remain:

Why does the tree outperform the standard regression model?

Can you grow a better tree than the one grown using the default settings?

What threshold should be used as a cutoff for rejecting loans?

NEURAL NETWORK MODELING

One reason for the superiority of tree may be its innate ability to handle nonlinear associations between the inputs and the target. To test this idea, try modeling the data with another flexible, nonlinear regression method - a neural network.

Neural networks are more difficult to interpret than standard regression or decision tree models and are, therefore, not accepted for use by the Equal Credit Opportunity Act. As some of you may know, the Equal Credit Opportunity Act requires loan agents to create an empirically derived, statistically sound, and clearly interpreted credit model. Neural networks do not fall into this classification because the statistics derived by the neural network are difficult to interpret (in fact, at times you'll hear of neural network models referred to as 'black boxes', meaning it's difficult to understand and interpret their inner workings). However, neural networks are an important statistical technique that can be used in many circumstances where interpretation is not important. Neural networks are frequently used for purposes as diverse as detecting credit card fraud to customizing web sites according to consumer purchase behavior.

Although you may not be able to use the neural network model for the purposes of credit scoring, it may give some insight into the differences between the standard regression and neural network models. It also sounds good when you can tell youre a prospective employer that youve got introductory experience using neural networks in data mining (Build the Neural Network Model

1. Close the lift chart window, then close the assessment tool window (by clicking on the X in the upper right hand corner) and return to the Workspace area. If you maximized the chart, you may want to restore the window to its previous size by clicking on the restore window icon on the second line in the upper right hand corner.

2. Add a Neural Network node to the workspace. (Right mouse click in the Workspace and left click on Add Node.., then click Neural Network from the Node types list. A Neural Network icon will appear in the workspace.)

3. Connect the Data Replacement node to the Neural Network node (like Regression models, Neural Networks dont work well with missing values, so we use the data replacement node to estimate and fill in missing values).

4. Connect the Neural Network node to the Assessment node. The completed diagram should resemble Figure 4 below.

Figure 4Run the Neural Network Model1. Double click on the Neural Network icon to open the node.

2. Go to the Initialization tab.

3. Replace the value to the right of the Generate New Seed button with the number 611. Note that the seed is just a random number generator used in setting a starting point for training neural networks. Were cheating a bit by setting a starting point (its usually randomly generated) so that your neural network will generate faster (although it's still a long process). Most of the time youll just accept the random value.

4. Close the Neural Network window. (If asked if you want to save changes, click Yes. You don't need to name the model.)

5. Run the diagram from the Assessment node in the Workspace (right-click on assessment and select run). A graph similar to the one shown in Figure 5 will begin crawling across the screen as the neural network is trained (built) and validated. Prepare for a wait - the model may take more than ten minutes to run (around 80 iterations).

Figure 56. When done, click the Yes button to view the results. (Note: It may take up to a minute between the time the graph stops drawing itself and the dialog box appears.)

7. Go to the models tab.

8. Highlight all three rows that represent the models by holding the control key and clicking on each of the three model names.

9. In the Tools menu (top tool bar), select Lift Chart.

10. Resize or maximize the window so that you can see the legend below. A diagram similar to Figure 6 will appear.

Figure 6 Assessing the Performance of the Three Models11. Print a copy of the %Response chart by selecting Print from the File menu.

12. Click on the %Captured Response radio button.

13. Print a copy of the %Captured response chart by selecting Print from the File menu.

Although the neural network model performs slightly better than the standard regression model, the decision tree model is still the best. Nonlinear association does not explain the difference between the decision tree and the standard regression model. (When you are done examining the graph, close the Lift Chart and Assessment Tools windows.)

Note: you can stop the assignment here hand in the printed copy of the %Response and %Captured Response charts detailing the performance of the three models (two pages). Please be sure to write your name and section number of the top of both of these pages and staple them together. Those wishing to learn more about exploring the Decision Tree model are welcome to follow along the following pages, but you do not have to. Data mining is currently one of the hottest topics in information systems and marketing. Students seeking to get an even deeper introduction to the SAS Enterprise Miner tool (and beef up their resumes in the process) may wish to check out the book Getting Started with Enterprise Miner(TM) Software. Its available for about $10 from most large Internet bookstores.

DECISION TREE MODELING: FURTHER EXAMINATION

Decision Tree Summary & Diagram

Lets take a closer look at the Decision Tree model that was generated by Enterprise Miner. Recall that Decision Tree models consist of a set of rules used to split the data into a hierarchy. In this example, the rules developed by the Decision Tree make up a model that can assist us in classifying whether an applicant is likely to default on his or her loan. To explore the Decision Tree created by Enterprise Miner:

1. Open the Decision Tree node results window by right mouse clicking on the Decision Tree icon in the workspace and selecting Results.

2. Take a look at the All tab. (It should resemble Figure 7).

When creating a decision tree model, Enterprise Miner will generate rules used to divide the dataset into distinct segments or groupings (e.g. observations with a variable greater than X in one group, observations with a value less than or equal to X in another group). Each segment split may have its own rules applying to a given variable. The final segments that are not further segmented are referred to as leaves. The table in the lower left-hand corner of the results window has highlighted the row with 7 leaves, indicating that this is the optimal number of leaves generated by the most efficient decision tree examined. Breaking data into seven groups has been determined by the model as the optimum number, and gives results with over 88% accuracy. As indicated in the validation column summary information, breaking data into more or fewer distinct groups would actually decrease the accuracy of the model.

Figure 7: From the summary information, you can see that it takes only seven leaves to beat the regression and neural network models (refer to the Number of Leaves chart in the lower right). The assessment table (lower left) gives a validation accuracy of 88.87%. (Note that the accuracy actually decreases after seven leaves, so the optimal tree has 7 leaves)

Examining the Tree

3. Select Tree from the View menu (top tool bar). A picture resembling Error! Reference source not found. should appear (you may have to scroll the window to see the entire picture).

Figure 8: As indicated in the tree summary (see Figure 7), this decision tree should have seven leaves (depending on the resolution of your monitor, you may have to scroll to see the entire tree).

At this point, only six leaves (terminal groups with no more splits) are visible, but the previous dialog indicated that there should be seven leaves in the tree. By default the decision tree viewer only displays three levels deep. In order to make sure that we can see all seven leaves, increase the depth of the tree (the number of layers shown).

4. Select Tree Options from the Tools menu

5. Type 6 in the Tree depth down field.

6. Select OK. The complete tree which shows all seven leaves is now visible. It should resemble Figure 9.

Figure 9: The complete decision tree diagram contains seven leaves, one for each distinct group of characteristics of a loan applicant.

Color Analysis of the Decision Tree Model

By default, the colors in the decision tree (as well as the tree ring diagram) are set to indicate node purity. If the tree branch or leaf contains mostly ones or all zeros, the node is colored red. If the node contains a mix of ones and zeros, it is colored yellow or orange.

You can change the coloring scheme to indicate target proportion of good vs. bad loans, instead of purity. To change the output so that nodes containing mostly good loans are green, while those containing mostly bad loans are red:

1. Close the tree diagram window to return to the Decision Tree-Results window.

2. Select Define Colors from the Tools menu. The Data Splits-Color Palette window will open.

3. Select the Proportion of a target value radio button

4. Select 0 in the Select a target value table

5. Select OK.

6. Return to the tree diagram by selecting tree from the view menu.

The decision tree color palette is updated so that red indicates a high proportion of bad loans (where the dependent or target variable BAD = 1), yellow indicates a balanced mixture of bads and goods, and green indicates a high proportion of good loans.

Using this new color palette, you can see how the tree is splitting the data. Figure 10 shows the decision tree with the new, updated color scheme.

Figure 10: Decision Tree with greener nodes indicating a higher proportion of good loans (BAD=0) and reder nodes indicating a higher portion of bad loans (BAD=1).

Analysis of Decision Tree Model Splits What Rules Did the Model Derive?

To see what sorts of rules are generating the splits in the decision tree model, you can browse the tree or tree ring diagrams.

For example, the first split of the decision tree diagram is done based on debt to income ratio (the DEBTINC variable). By examining the values in the boxes under each split, you can see how the decision rule works on the analyzed data. If the debt to income ratio is greater than or equal to 45.185, then roughly 64% of loans are bad. For those loans with a debt to income ratio less than 45.185, 93% of them are good loans. Data are further split until the boxes arrive at terminal segments or leaves. You can work from this tree to identify how rules were created for the suggested Decision Tree model.

Unfortunately there isnt a real good, comprehensive view of the decision rules generated by SAS and the values associated with each classification. You can explore the results a bit more thoroughly using the tree ring diagram. The steps below will show you how to explore this with a bit more depth.

To study the tree ring diagram, close the Decision Tree Diagram window and return to the Decision Tree-Results window, All tab.

1. Select the view information tool (looks like an arrow with a question mark on it).

2. Click in the desired segment of the tree ring (Decision Tree-Results window, All tab) to see the variable used to define the segment.

For example, select the mixed segment generated by the first split. (Node 3: the centermost orange ring. It may be helpful to maximize this window so that clicking on this ring is easier.) The window shows that this segment contains applicants with debt to income rations in excess of 45.185 and those with missing values. (Approximately 25% of the total data is either has debt to income ratio in excess of 45.185 or has missing values for DEBTINC). Recall from the histogram that you created earlier in this exercise that DEBTINC had a missing value rate of more than 20%. This implies that most of the applicants in this segment have missing values.

3. Select Probe Tree Ring Statistics from the Tools menu.

4. Select the mixed segment generated by the first split again.

The view information tool shows the more than 64% of the applicants who have a high (greater than or equal to 45.185) or missing debt to income ratio resulted in bad loans. (We know this refers to high DEBTINC or missing values because we just identified that this in step 2 above). The majority of the applicants in this segment have a missing debt to income ratio. So if an applicant fails to report the information required to calculate a debt to income ration, they are very likely a bad credit risk. In short, DEBTINCs missing status is highly associated with the target BAD=1.

This fact also explains why the decision tree model outperforms the other modeling methods. Both the Regression and Neural Network methods estimated values for missing values using the Data Replacement node. But by using the Data Replacement node, the association between DEBTINC and the target was hidden because the neural network and regression models used estimated values for the missing values. Since Decision Trees can create classifications without having to use Data Replacement, this model was able to classify missing values as part of the decision rule criteria and this classification helped produce a better model!

Note: different models are appropriate for different circumstances, while this situation showed that the Decision Tree model was superior, you are likely to encounter situations whereby some of the other model forms are superior.

Building a Tree from the Ring Diagram

The Decision Tree nodes view path feature provides a convenient way to understand the decision rules associated with each tree ring segment.

1. From the Decision Tree-Results window, select the selection arrow tool (the one that looks like an arrow; NOT the arrow with the question mark)

2. Select the red segment (containing lots of bad loans) at the three oclock position in the tree ring diagram.

3. Select Path from the View menu. A simplified the tree diagram will appear that shows only those decisions used to generate the selected segment. The diagram should be similar to Figure 11.

Figure 11: The segment selected consists of applicants with a high or missing debt to income ratio and at least one delinquency (DELINQ >= 0.5). Of the 323 applicants in the training data who share these characteristics, almost 83% are bad credit risks (center column). Similar results hold for the validation data (right column).

You can also view the results (not the decision tree) of the segment you have selected.

1. Close the Tree Diagram window to return to the Decision Tree-Results window.

2. Select Tree from the View menu.

The diagram tree has collapsed to one leaf which corresponds to the segment that was selected. To view the entire tree again, close the Tree Diagram window and select the center of the tree ring diagram from the Decision Tree-Results window and select Tree from the View menu.

SUMMARY

Congratulations on completing the Data Mining exercise. Recall what you did:

You were faced with the problem of building an appropriate model for evaluating loan applicants

You instructed the Enterprise Miner to examine the data and build three models based on Regression, Decision Tree, and Neural Network statistical techniques.

You evaluated each of these models against each other and discovered that the Decision Tree model had the most predictive power.

You further explored the Decision Tree and determined that its predictive power lay in the fact that it was able to classify missing value items and that missing debt-to-income ratios were a flag picked up by the rule that the applicant was likely to default on his or her loan.

There were a number of things not covered in this exercise. While we examined the rules identified by the Decision Tree, we didnt look at the specifics of the regression model or the Neural Network. It is possible to take these models and incorporate them into programs that can be used for evaluating loan credit applications (or use Enterprise Miner's 'scoring' node to employ the model against a new data set), however such an exercise is beyond the scope of the simple test drive that youve performed here.

Enterprise Miner is an extremely powerful tool and we havent even scratched the surface of its abilities. For example, those of you familiar with statistical techniques such as regression may want to know more about how you can specify variable selection or evaluate the robustness of the derived model. Those interested in exploring Enterprise Miner further are encouraged to visit the SAS web site at www.sas.com. The firm provides a number of manuals and offers several additional training programs. The SAS help menu may also provide significant documentation and background information that can help your further exploration. The text, Getting Started with Enterprise Miner(TM) Software, available at most Internet bookstores for about $10 and contains an additional example that takes the user through credit scoring and explores some of the other aspects of the tool in greater depth. Also be aware that there are many other providers of Data Mining software, each with their own strengths and weaknesses.

To Be Handed In by Exam:

The printed copy of the %Response and %Captured Response charts detailing the performance of the three models (two pages). Please be sure to write your name and section number of the top of both of these pages and staple them together.

For those of you familiar with SAS, the nodes in the Miner actually write program code in the SAS language. Like the Design Time Controls used in Visual Interdev, the graphical 'nodes' of the SAS environment insulate the user from the drudgery of programming and make the tool available to a broader audience.

PAGE \# "'Page: '#''" Note was added to explain the replacement node when the node was originally included in the model (earlier section).

PAGE \# "'Page: '#''" That should have said seven rather than several leaves. Ive also added a note (preceding sentence) to clarify.

PAGE \# "'Page: '#''" When exploring the data in the Input Data Source, the notes asked you to notice that DEBTINC

had a high percentage of missing values (over 20%). When you look at the tree at this point, you

identify that the missing values for DEBTINC (the first split) go into a node that contains

about 25% of the data. If a node contains 25% of the data and over 20% are missing, than this

implies that most of the applicants in this node (or segment on the tree ring plot) have

missing values.

p. 21