sergiu buciumas, department of statistics and · pdf filesergiu buciumas, department of...

2
#analyticsx Combining Logistic Regression and Time Series Analysis on Commercial Data for modeling Credit and Default Risk Sergiu Buciumas, Department of Statistics and Analytical Sciences, Kennesaw State University Supervised by Jennifer Lewis Priestley, Ph.D., Department of Statistics, Kennesaw State University ABSTRACT Despite a surge in the research efforts put into modeling credit risk during past decade, few studies have incorporated Time series Analysis and Logistic Regression but basically most focusing on survival analysis. Researching literature resources seems is a gap in this domain. Most of the models are strictly focusing on Time series or Logistic Regression for predicting mortgage default. After 2008 Federal Reserve Bank started to perform credit portfolio analysis based on Time series and consumer credit reporting agencies mostly focusing on Regression models in building scorecard for consumers but not too much research in incorporation for consumer scorecard. Credit models are useful to evaluate the risk of consumer loans. The application of the technique with greater precision of a prediction model will provide financial returns to the institution. In this poster will analyze credit risk modeling for commercial data by combining Logistic Regression and Time Series to can see distribution in time of the prediction of good score and classify consumers based on behavior. We know how important is credit risk modeling for bank on offering loans for commercial clients. This paper combines quantitative and qualitative techniques to predict credit risk by forecasting probabilities. Data was collected from the credit scoring agency Equifax, one of the major credit bureaus in the United States., containing 305 variables and quarterly observations from 2006-Q1 to 2014-Q4. Seven variables were selected as inputs based on their effects on score and to limit multicollinearity: Total Cycle 4+ Non- Financial Past Due Amount in Last 12 Months Ratio of Number of Non-Financial Accounts Non-Delinquent Currently to Number of Non-Financial Accounts Reported in Last 3 Months with Known Current 203 NFA3monCurRate Delinquency Status, Worst Non-Financial Payment Status in the Last 12 Months, Worst Non-Financial Payment Status in the Last 3 Months, Worst Telco Payment Status, Worst Industry Payment Status, Percent of Non-Financial Past Due Amount to Total Balance Reported in Last 12 Months. The target variable was a binary variable depicting the risk category probability for each quarter (1=good, 0= bad). Data was prepared for modeling by applying imputation and transformation on variables. Two-step analysis was used to forecast the default probabilities for the short-term period. Predicted default probabilities were first obtained from the Backward Elimination Logistic Regression model that was selected on the basis of misclassification. These probabilities were then forecasted using the Time series. These probabilities were then forecasted using the Exponential Smoothing method that was selected on the basis of mean average error. Results show the forecast for eight quarters (up to 2016- Q4). Logistic regression Our business problem is to evaluate the credit quality of debtors. All the predictor variables are put into a logistic regression with the good score variable as the response variable. A backward selection process was performed to ensure that every variable was in the model in the beginning, and then removed based off of a cutoff point at the .05 level of significance. The least significant variable is removed in each step until all remaining variables are significant. Some output from the logistic regression is shown below in Table Analysis of Maximum Likelihood Estimates below. The coefficients in the model are interpretable just like in a regular regression model. For instance, the coefficient of the WstNFpay12mon variable (Worst Non-Financial Payment Status in the Last 12 Months) is 0.2797. The only difference is that for a one unit increase in the WstNFpay12mon variable, the change in the log odds is 0.2797. Since this is a logistic regression, the interpretation of the variable coefficients can be expressed as a change in the logit function, or the log odds of default. By exponentiation the coefficient (e^0.2797), the resulting number is the odds ratio. For every one unit increase in the ffwcrate variable (for each additional cycle a customer is delinquent on a finance account), a customer is 1.31 times more likely to default. Table Odds Ratio Estimates only contains the top 7 variables from the model, for the purposes of having a simple model. Variables are sorted in order of the largest chi- square statistic, which is representative of the most significant variables in the model.

Upload: phamque

Post on 06-Feb-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sergiu Buciumas, Department of Statistics and · PDF fileSergiu Buciumas, Department of Statistics and Analytical Sciences, Kennesaw State University Supervised by Jennifer ... FORECASTING

#analyticsx

Combining Logistic Regression and Time Series Analysis on Commercial Data for modeling Credit and Default Risk

Sergiu Buciumas, Department of Statistics and Analytical Sciences, Kennesaw State University

Supervised by Jennifer Lewis Priestley, Ph.D., Department of Statistics, Kennesaw State University

ABSTRACTDespite a surge in the research efforts put into modeling credit risk during past decade, few studies have incorporated Time series Analysis and Logistic Regression but basically most focusing on survival analysis. Researching literature resources seems is a gap in this domain. Most of the models are strictly focusing on Time series or Logistic Regression for predicting mortgage default. After 2008 Federal Reserve Bank started to perform credit portfolio analysis based on Time series and consumer credit reporting agencies mostly focusing on Regression models in building scorecard for consumers but not too much research in incorporation for consumer scorecard. Credit models are useful to evaluate the risk of consumer loans. The application of the technique with greater precision of a prediction model will provide financial returns to the institution.

In this poster will analyze credit risk modeling for commercial data by combining Logistic Regression and Time Series to can see distribution in time of the prediction of good score and classify consumers based on behavior. We know how important is credit risk modeling for bank on offering loans for commercial clients.

This paper combines quantitative and qualitative techniques to predict credit risk by forecasting probabilities. Data was collected from the credit scoring agency Equifax, one of the major credit bureaus in the United States., containing 305 variables and quarterly observations from 2006-Q1 to 2014-Q4. Seven variables were selected as inputs based on their effects on score and to limit multicollinearity: Total Cycle 4+ Non- Financial Past Due Amount in Last 12 Months Ratio of Number of Non-Financial Accounts Non-Delinquent Currently to Number of Non-Financial Accounts Reported in Last 3 Months with Known Current 203 NFA3monCurRate Delinquency Status, Worst Non-Financial Payment Status in the Last 12 Months, Worst Non-Financial Payment Status in the Last 3 Months, Worst Telco Payment Status, Worst Industry Payment Status, Percent of Non-Financial Past Due Amount to Total Balance Reported in Last 12 Months. The target variable was a binary variable depicting the risk category probability for each quarter (1=good, 0= bad). Data was prepared for modeling by applying imputation and transformation on variables. Two-step analysis was used to forecast the default probabilities for the short-term period. Predicted default probabilities were first obtained from the Backward Elimination Logistic Regression model that was selected on the basis of misclassification. These probabilities were then forecasted using the Time series. These probabilities were then forecasted using the Exponential Smoothing method that was selected on the basis of mean average error. Results show the forecast for eight quarters (up to 2016-Q4).

Logistic regression

Our business problem is to evaluate the credit quality of debtors. All the predictor variables are put into a logistic regression with the good score variable as the response variable. A backward selection process was performed to ensure that every variable was in the model in the beginning, and then removed based off of a cutoff point at the .05 level of significance. The least significant variable is removed in each step until all remaining variables are significant. Some output from the logistic regression is shown below in Table Analysis of Maximum Likelihood Estimates below. The coefficients in the model are interpretable just like in a regular regression model. For instance, the coefficient of the WstNFpay12mon variable (Worst Non-Financial Payment Status in the Last 12 Months) is 0.2797. The only difference is that for a one unit increase in the WstNFpay12mon variable, the change in the log odds is 0.2797. Since this is a logistic regression, the interpretation of the variable coefficients can be expressed as a change in the logit function, or the log odds of default. By exponentiation the coefficient (e^0.2797), the resulting number is the odds ratio. For every one unit increase in the ffwcrate variable (for each additional cycle a customer is delinquent on a finance account), a customer is 1.31 times more likely to default. Table Odds Ratio Estimates only contains the top 7 variables from the model, for the purposes of having a simple model. Variables are sorted in order of the largest chi-square statistic, which is representative of the most significant variables in the model.

Page 2: Sergiu Buciumas, Department of Statistics and · PDF fileSergiu Buciumas, Department of Statistics and Analytical Sciences, Kennesaw State University Supervised by Jennifer ... FORECASTING

#analyticsx

Combining Logistic Regression and Time Series Analysis on Commercial Data for modeling Credit and Default Risk

Sergiu Buciumas, Department of Statistics and Analytical Sciences, Kennesaw State University

Supervised by Jennifer Lewis Priestley, Ph.D., Department of Statistics, Kennesaw State University

Logistic regression continued

Table Frequency table for good_score does provide us information related to distribution of good_score for our dataset. We can see that consumers labeled as good_score1 are 50.44% from all population, meaning that the individuals from dataset are preponderate labeled by our model.

Conclusion

Logistic regression modeling resultsOne primary goal of our model is to generate an equation that can reliably classify observations into one of two outcomes. For our model the degree to which predictions agree with the data are shown graphically by receiver operating characteristic (ROC) Figure 4, in our case Area Under the Curve = 0.6384.

The ROC curve does provide a graphical description of the sensitivity versus 1 minus specificity. In our case sensitivity is defined as the proportion of observations correctly classified as an event (true positive fraction) and we actually received a good ROC for our model . Ideally, the curve does climb quickly toward the top-left meaning the model correctly predicted the cases. Index plots of the Pearson residuals and the deviance residuals and ROC Curve for model3 listed below.

Based on our research we see that time series and binary logistic regression output data can produce meaningful results in credit risk modeling. The risk management profession is already getting better at integrating a number of different time series techniques into the credit landscape. One such example is survival analysis, which is intended to yield insight into point-in-time default predictions needed for profitability and financial projections, used to forecast and analyze mortgage portfolio for financial institutions.

This poster has presented another way to integrate time series data into the credit risk modeling process that deals with the design of a new class of predictive characteristics. Although simple and straightforward, this design could improve the impact. On the portfolio forecasting side, models using panel or pooled data approaches are gaining popularity because they are designed to evaluate changes across the business cycle for purposes of strategic planning and stress testing. These approaches allow for comparisons to be made relative to a base case outlook and can incorporate a variety of economic scenarios dependent on significant risk factors, such as the impact of government stimulus, unemployment, economic crisis and many more feature, development of certain scorecards, especially if they are created or implemented in an environment dependent of economic impact.

References[1] http://www.bis.org/publ/bcbs75.htm. (references)

[2] http://support.sas.com/resources/papers/proceedings15/3212-2015.pdf[3] http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm[4] https://documents.software.dell.com/statistics/textbook/time-seriesanalysis[5] http://siteresources.worldbank.org/EXTLACOFFICEOFCE/Resources/8

70892-1206537144004/MarquezIntroductionCreditScoring.pdf.[6] Hosmer, David W.; Lemeshow, Stanley (2000). Applied LogisticRegression (2nd ed.). Wiley. ISBN 0-471-35632-8[7] Peduzzi, P.; J. Concato; E. Kemper; T.R. Holford; A.R. Feinstein (1996)."A simulation study of the number of events per variable in logistic

FORECASTING MODEL USING TIME SERIESFor forecasting good_score probabilities from logistic regression has been used Time series seasonality. ArchiveDate

variable has been used as TIME ID variable to create a time series variable by having quarterly intervals. Based on output listed, we do see that forecast model is adequate and only one outlier is detected. That outlier usually is inspected individually in order to not impact our forecast. We detect seasonality in our forecast

and also from 2006-2014 is a clear seasonality data fitted into model.