Credit Default Classifier Model

This is a “tutorial” post or more of an “experiment” on how to use the random forest algorithm to estimate credit default probability. This explains the variables used in this classifier model and how to implement the random forest algorithm in Python, using Google Colab Notebook. Or you could download it as a python script to your system and run it on your local machine – which generally is faster than running on the basic version of Google Colab Notebook. 

What is the objective here?

Basically, in simpler terms, this is a tool that learns from historical data to predict whether a new customer is likely to default on a loan. 

  1. It begins by importing necessary tools for data analysis and machine learning. It then loads a csv dataset into a dataframe using pandas. 
  2. After loading the data, any rows with missing information are removed to ensure the data is clean and complete. The data is then split into two parts: one for training the model and the other for testing its accuracy. Additionally, the data is standardized to ensure that different types of information (like age, income, etc.) are on a similar scale for the model to understand.
  3. The code uses a machine learning algorithm called Random Forest Classifier to predict whether a customer will default on a loan. This algorithm is trained using the training set, and hyperparameter tuning is performed to find the best settings for the algorithm. The model is then retrained using the best settings found during the hyperparameter tuning process.

There’s a plot that shows how the model’s performance changes with different settings. This is done to find the best combination of settings that gives the highest accuracy in predicting defaults. Another plot shows how well the model learns from the data over time. It helps visualize how the accuracy of predictions improves as the model sees more examples during training. The accuracy of the model is calculated using a test set that it hasn’t seen before. This accuracy score is a measure of how well the model is expected to perform on new, unseen data. A classification report is also generated to provide more detailed information on the model’s performance.

Finally, we demonstrate how to use this trained model to predict whether a new customer is likely to default on a loan. We will create a sample data frame with information about a new imaginary customer and use the trained model to make a prediction on whether he is defaulting on the loan or not given his circumstances.

Explain the CSV data

Since this is only an experiment the loan data is not that sophisticated. It takes quantitative data into consideration to predict the probability of default. These input variables are: 
age: Age of the Customers, ed: Education Level, employ: Work Experience in Years, address: Address of the Customer, income: Yearly Income of the customer, debtinc: Debt to Income Ratio, creddebt: Credit to Debt ratio, othdebt: Other debts

age ed employ address income debtinc creddebt othdebt default
41 3 17 12 176 9.3 11.35939 5.008608 1
27 1 10 6 31 17.3 1.362202 4.000798 0
40 1 15 14 55 5.5 0.856075 2.168925 0
41 1 15 14 120 2.9 2.65872 0.82128 0
24 2 2 0 28 17.3 1.787436 3.056564 1
41 2 5 5 25 10.2 0.3927 2.1573 0
39 1 20 9 67 30.6 3.833874 16.66813 0
43 1 12 11 38 3.6 0.128592 1.239408 0
24 1 3 4 19 24.4 1.358348 3.277652 1

Limitations: This is time variable independent, non-probabilistic, classifier model

This model is not your time horizon based probability of default model (PD Model) that IFRS 9: Financial Instrument requires for the calculation of the Expected Credit Loss (ECL) – this is much simpler and again just an experiment with the data. However, even with that limitation, this is also not just a linear regression model – as we are working with multivariate data that takes at least 8 quantitative data of a borrower into consideration for classifying the chances of default associated with the loan. 

In summary: 
1. This is not a time variable dependent model
2. This is not probability dependent model but is only a classifier model (only classifies whether the loan with default or not)
3. This is not just a linear regression analysis but works with multiple variable using a tree bagging methods (i.e. Random Forest Model)

Try it in Google Colab Notebook

Here is the link to the dataset: 

Here is the link to the Google Colab Notebook. You can duplicate this notebook to your drive and start experimenting on your own.

Or try it in your local machine with the python code below

            # courtesy of critical spaghetti

"""PD Model with Random Forest Algorithm

Original file is located at

# Loading the dependencies

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
import pandas as pd
import matplotlib.pyplot as plt

"""# Loading the CSV into the dataframe"""

# Load the dataset from the provided link
csv_link = ""
df = pd.read_csv(csv_link)

# Drop rows with missing values
df = df.dropna()

# Extract features and target variable
x = df.drop(['default'], axis=1)
y = df['default']

"""# Splitting into training / testing set and Scaling for non-uniformity"""

# Split the data into training and testing sets
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42)

# Standardize the feature data using StandardScaler
sc = StandardScaler()
xtrain = sc.fit_transform(xtrain)
xtest = sc.fit_transform(xtest)

"""# Training the Default Classifier Model using Random Forest Algorithm"""

# Hyperparameter tuning for Random Forest Classifier with an extended search space
param_grid = {
    'n_estimators': [200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap': [True, False]

# Create a Random Forest Classifier
rfc = RandomForestClassifier()

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='accuracy', cv=5, n_jobs=-1), ytrain)

"""# Plotting the grid search progress"""

# Extract the results of the grid search
results = pd.DataFrame(grid_search.cv_results_)

# Extract relevant information for plotting
param_cols = ['param_' + param for param in param_grid.keys()]
mean_test_scores = results['mean_test_score']

# Handle 'max_features' separately as it is a categorical variable
max_features_values = results['param_max_features']
unique_max_features = max_features_values.unique()

# Create line plots for each hyperparameter
plt.figure(figsize=(16, 10))
for max_feature_value in unique_max_features:
    mask = (max_features_values == max_feature_value)
    plt.plot(results[mask]['param_max_depth'], mean_test_scores[mask], marker='o', label=f'max_features={max_feature_value}')

# Set labels and title
plt.ylabel('Mean Test Score')
plt.title('Grid Search Progress')

"""Retraining the default rate classifier with the best parameters"""

# Create a Random Forest Classifier with the best hyperparameters from the grid search
best_params = grid_search.best_params_
rfc_best = RandomForestClassifier(**best_params)

# Get the best parameters and retrain the model
best_params = grid_search.best_params_
best_rfc = RandomForestClassifier(**best_params), ytrain)

"""# Plotting the learning curve of the classifier model"""

# Plot learning curve
plt.figure(figsize=(12, 8))
train_sizes, train_scores, test_scores = learning_curve(rfc_best, xtrain, ytrain, cv=5, scoring='accuracy', train_sizes=np.linspace(0.1, 1.0, 10))

# Calculate mean and standard deviation of training and test scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Plot the learning curve
plt.plot(train_sizes, train_mean, label='Training Score', marker='o')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.15)
plt.plot(train_sizes, test_mean, label='Validation Score', marker='o')
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.15)

# Set labels and title
plt.xlabel('Training Examples')
plt.ylabel('Accuracy Score')
plt.title('Learning Curve')

"""# How accurate is the default classifier model?"""

# Evaluate the improved Random Forest model on the test set
improved_score = best_rfc.score(xtest, ytest)
print("Improved Random Forest Classifier Accuracy:", improved_score)

# Display classification report for more detailed evaluation
ypred = best_rfc.predict(xtest)
print("\nClassification Report:\n", classification_report(ytest, ypred))

"""# Try this classifying model with your own data"""

# Create a DataFrame for new data
new_data = pd.DataFrame({
    'age': [41], # Age of the Customers
    'ed': [1], # Education Level
    'employ': [17], # Work Experience in Years
    'address': [12], # Address of the Customer
    'income': [176], # Yearly Income of the customer
    'debtinc': [9.3], # Debt to Income Ratio
    'creddebt': [11.359392], # Credit to Debt ratio
    'othdebt': [5.008608] # Other debts

# Will predict you the default rate Customer defaulted in the past (1= defaulted, 0=Never defaulted)

# Preprocess the new data using the same StandardScaler object
new_data_scaled = sc.transform(new_data)

# Use the trained RFC model to make predictions on the new data
prediction = best_rfc.predict(new_data_scaled)

# Print or use the prediction as needed
print("ill predict you the default rate Customer defaulted in the past (1= defaulted, 0=Never defaulted)")
print("Predicted Default for the dataframe you entered above :", prediction[0])