Introduction

After some Googling and reading of various blog posts and articles, I decide to carry out a few different feature selection techniques, record them all in a pandas frame, and pick out the important features as appropriate. The feature selection techniques I use are:

Visualising the feature scores

Below are plots showing how the different methods of measuring feature importance compare with one another.

The main takeaways for me are:

To remove the non-linearity in some of the charts above, I decided to also plot feature ranks that these different measures produce.

There is nothing new shown in these graphs - it just makes the patterns listed above a bit clearer.

Models with only the most important features

Next I produced several logistic models keeping differing amounts of features removed. I used logistic models because they were the quickest to create.

The patterns here are clear. My takeaways are:

Conclusion

It looks like removing the least important features has not improved our models. The one thing it did improve was the time taken to create the models. Also, in a real-life situation (where we knew what the variables corresponded to), we would have gained insight into which variables are important, which presumably would help in decision-making.

Next steps

The next thing I will do is some hyper-parameter optimisations. After that, I will have used up all the tricks I have available, and then look at other people’s models and see what I can learn.