Predicting the credibility of future bank customers
Accuracy is the fraction of predictions that are correct (represented by the green and red squares in the image above), obviously we want our model to be as accurate however it is not the only aspect to consider.
In our scenario a false positive occurs when a customer who is a bad credit risk is labelled as a ‘good’ customer increasing the potential for loss due to defaults, a false negative on the other hand, occurs when a customer who is a good credit risk is classified as ‘bad’ resulting in the loss of potential revenue. Clearly, a customer defaulting on their repayments is more concerning for the bank as it imposes a quantifiable cost to the bank so the number of false positives is an important factor to consider in comparing the models.
Predictive Modelling
A logistic regression was used to obtain a prediction model in the data set, the prediction was then used in the test set to determine whether the model was able to generate a proper prediction for the data set. From the confusion matrix we can see that the overall accuracy for logistic regression is 71.1% and the predicted to true negative results is 60%. The predicted to true positive result is 76.4%.
Logistic Regression & Random Forrest
Random forest was used to produce the prediction, from the first table above we can see that the overall accuracy for random forest was 72.8% , the true positive result is 76.4% and the true negative result was 64.5%, this means that although the logistic regression have higher overall accuracy compared to random forest, random forest is more useful in terms of avoiding for clients that potential can not repay their debts.
When applying the tree diagram to obtain the variables that contributes most significantly towards the prediction we obtained the following:
From the tree diagram we can see that the most effective predictors among the groups were account balance for the clients with account balance of more than or equal to 200EU or no account there were an 86.6% of possibilities that the debts can be repaid, while for clients with 0EU to 200EU in their accounts there were a 43.9% chance of the debts can not be repaid by the client.
Decision Tree & Deep Learning
Comparing these approaches we find that the Deep Learning model has a higher overall accuracy and a higher precision for predicting good. Meaning that, the Deep Learning model accepts slightly less ‘good’ customers, but those accepted are less likely to be a bad credit risk (decreasing chances of defaulting), this comes with the cost of rejecting more ‘good’ customers. Whereas the decision tree model will miss out on fewer ‘good’ customers with the cost of accepting more bad credit risk customers. Since it is more important for the bank to have a higher precision for predicting good credit risks, we can conclude that the Deep Learning model is more appropriate for use in the bank than the decision tree model.
Predictive Modelling Conclusion
Overall, it is most appropriate to employ the Deep Learning model, while logistic regression produces more accurate overall predictions in the data, we are more concerned with the true positive predictive value, as it gives the bank information on whether the clients are able to repay the credit loan. From the analysis we can observe that Deep learning provides a much higher accuracy rate in the true positive predictions of 88.24% compared to the rest of the prediction method. Therefore the deep learning model is the most appropriate model to use as it allows the bank to avoid bad credit clients that result in financial loss.
Finding market segments using clustering techniques
Visually, hierarchical clustering seems to have less outliers.
Dissecting and comparing the variables in each cluster can identify market segments and possible factors in predicting good and bad credibility.
From the plots above we can identify some variations between the 2 clusters. Cluster 1 is compiled of a majority of females (about 75%), their time spent in their current job is <1 - 4 years, 95% of them have <3 people dependent on them, the main purpose of the loan was appliances and business/equipment; and their ages were mainly between 20-30.
Cluster 2 was made up of a majority of males (90%), their time spent in their current job was split 30% between greater than 7 years and between 1 - 4 years. 75% of the cluster had greater than 3 dependents to them, the main purpose of purchase of this group was car related purchases; and their main ages were between 20-40.