Santander Dataset, Part III, Learning from others
I end this series by describing what I learnt by reading other people's kernals on Kaggle.
Other posts in series
Introduction
I did some hyper-parameter optimisations, but there was nothing particularly noteworthy. I got some incremental improvements using GridSearch and that’s about it. After that, I had a look at what other people on Kaggle did and see what I can learn from them.
Read the instructions!
For some reason, I thought that the metric for this contest was accuracy? I suspect it is because I had read this line from the instructions: “For each Id in the test set, you must make a binary prediction of the target variable.” However, other people were submitting probabilities, not just binary predictions. So, I tried submitting the raw probabilities from my model, and the score jumped up from around 0.77 to 0.86! Something is a amiss. Have I misunderstood accuracy. I re-read the instructions and find that I somehow missed this line: “Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.” That makes more sense. But if they are using AUROC, then why do they ask for a binary prediction?!
EDA
This is something that I should have done, having noticed other people doing it for the credit card dataset, but forgot to do. I will make sure to do this for next time! (I did very minimal exploration, but there is clearly more I could do).
Two other algorithms
There are two more algorithms to add to my toolbox: Naive Bayes and LightGBM. Naive Bayes is intuitive and I am surprised I have not encountered it already. LightGBM seems to be a faster alternative to XGBoost, and this seems to be the most popular algorithm used in this challenge.
Creating a model for each feature separately
This idea seems to be first described in this example and here. The intuition for why this works is that the features are seemingly independent, so you can combine the predictions made from considering each feature, one at a time. I can’t remember which example it is, but somebody else calculated all the pairwise correlations between features and found they were all very small.
Using frequency as new feature
This idea was referred to as the magic feature (e.g. here): for each feature ‘F’, add a new feature ‘F_freq’ which is the frequency of the first feature ‘F’. It is not intuitive to me why this should improve the performance of the model.
Visualising a tree-based algorithm
In this example, there are excellent visuals demonstrating how LightGBM is making its predictions. In particular, there are excellent visuals that demonstrate how including the magic frequency feature improves the models.
A remarkable discovery
YaG320 made an incredible discovery: by looking at the frequencies that different values occur, they concluded that there must be synethic data and then separated out the real data from the synthetic. The reasoning was clever: by noticing that there were fewer unique values in the test data than in the training data, they suspected that the test data started out as some real data and then augmented with some synthetic data. Furthermore, the synthetic data will only use values that occured in the real data. To sniff out the synthetic data, you have to ask yourself: are any of its feature values unique? If yes, then it cannot be synthetic (as synthetic only uses values that already occured in the real data), and if not, then it is very likely synthetic (not certain but very likely). YaG320 wrote code to implement this idea, and found that exactly half the code was real and half was synthetic. Impressive detective work! By removing the fake synthetic data from the construction of their models, people were able to improve their models.
It is unlikely this exact idea will be useful for me in any future projects I will do. However, it highlights the power and thrill of data science. By looking at a bunch of numbers (and with a healthy dose of creativity) one can make deductions that would otherwise be completely hidden.
Conclusion
There is a lot for me still to learn! Trying to analyse these Kaggle datasets and then comparing my approach to others seems to be an excellent way to learn and I will be sure to continue it. However, I need to practice some data cleaning/data scraping, so I will start some projects in that vain soon.