Credit, Credit Bank, Credit Auto


 

naive_bayes.png


In fact, this is a visualization of the naive Bayes classifier, using loess smoother as a way of obtaining the conditional probability densities P(y|x). But regardless of that, we can see a relatively smooth almost-linear increase in risk, both with increasing duration and with increasing credit amount. In that respect, both variables seem to be about equally good, although duration is better, partly due to the problems with credit amount being leftwards skewed, so the big effects for large credits are somewhat infrequent.

But this is not the right way of doing regression: we have to model both variables at the same time. As the scatter plot shows, they are not independent:

scatter.png

This plot also seems to show that both of them are of comparative predictive power. But now consider the nomogram of the logistic regression model:

logistic.png>

The coefficient for the credit amount has shrunk considerably! This holds even if we performed Bayesian logistic regression and took the posterior mean as a summary of the correlated coefficients. Why was the credit amount that shrank and not the duration? I find the resolution of the logistic regression model somewhat arbitrary, in the spirit of "winner takes all".

A different interpretation is to use informat.png

The meaning is as follows:


  • Duration alone explains 2.64% of the entropy of the risk.
  • Credit amount alone explains 2.12% of the entropy of the risk.
  • There is a 1.03% overlap between the information both of them provide.
Conditional mutual information indicates how much one variable tells about the other if we control for the third variable. In this case, duration would explain 2.64-1.03=1.61% of risk entropy had we controlled for credit amount, and credit amount would explain 2.12-1.03=1.09% of risk entropy had we controlled for the duration.

The only problem with this approach is that one needs to construct a reliable joint model of all three variables at the same time as to be able to estimate these information quantities.

More information about this methodology appears in my dissertation.

  • simon

    Thanks for reminding me about nomograms ... they are a great tool and very under used ... the two articles are very good ... nice work ... this blog is getting better and better!