The Top 4 Robustness Checks for Predictive Models
Deciding whether your carefully crafted predictive model is finally ready for business or not can be a daunting task. A lot of predictive models fail to generate any positive impact and it is hard to know whether they’ll do before you deploy them.
There are, however, some signs that there might be trouble ahead. These signs are easily spotted if you know what to look for and should, therefore, be looked out for before any deployment of a predictive model.
But why should you listen to me? In my day job, I am the Chief Data Scientist at the AutoML company Gpredictive. At Gpredictive, we assist our clients in examining the auto-generated models to help them decide whether these models are useful for their business. For that reason our team has examined thousands of models in the last five+ years. Recently we decided to take it up a notch and build a feature that automatically creates a first sanity check on the generated models. We call this the automated model curation.
While developing this feature, we revisited our common experiences with model metrics and went through hundreds of cases where we knew the business result of the model deployment. In concordance with research on data science we found that there are two things you really want to avoid in your predictive model because models with these kinds of issues tend to collapse when used in the wild.¹
- “False Friends” (also known as target leakage. This happens when there is data that was marked to be before the forecasted event, but actually belongs to the event.)
- Overfitting (Fitting the training data, not the underlying process
Thus, we wanted to concentrate on the metrics that could indicate either “False Friends” or overfitting. The criteria we came up with are necessary but not sufficient criteria for a good model. These curation metrics are therefore meant as a warning light: If one of these metrics is in a bad state, it is a strong sign that something is wrong and this model should not be put in production without a very thorough vetting process.
These are the robustness checks. Please bear in mind that these checks refer to categorical supervised models which try to predict whether an event (e.g., a purchase) will take place or not.
The number of positive cases for the training
In order to have accurate predictions, the number of events in the model needs to be large enough to allow sensible inferences. You do not want your model shaped by anecdotal data which would inevitably lead to overfitting. In our use case, which is predicting future purchase behavior based on one individual’s customer journey, we found that there need to be at least 1,000 positive cases (purchases) for a model to be valid at all. In fact, we only call this metric “good” when there are at least 5,000 positive cases. We also found that there is typically no substantive increase in model quality with more than 15,000 cases.
The AUC-Value in the training data set
We settled on the very versatile “Area under the Curve” metric for capturing the prediction accuracy in the model. The one killer feature for us is that this metric works independent of the underlying distribution of positive/negative cases because it does not assume a fixed cut-off value. For a good explanation, look here.
We observed that all AUC-Values greater than 0.55, though not great, can deliver at least some value in practice. However, for a model to be really good, we found that the AUC should better be greater than 0.70. We also observed, that when a model has a very high AUC-Value, this may not be a good sign: We have a saying in our data science team: “If a model looks too good to be true, then very often it isn’t”.
So, we look critically at all models the have an AUC value greater than 0.92 and are not using models that have an AUC value greater than 0.99 because that almost always means that there is target leakage.
The difference between the AUC value in the training and the validation data set
Not only the absolute AUC value is of interest. We observed that models that have a high dispersion between the training AUC and the validation AUC (You do have a validation set, right? Please do have.) often indicates an overfitting problem. Overfitting means that the algorithm that built your model didn’t capture the underlying process in your data but “learned” your training data. This is not a good thing, because in real life this means you can make no meaningful prediction on new data. In terms of weather prediction: Your model may be an excellent predictor for yesterday’s weather but is not at all able to predict tomorrow’s weather.
We monitor this by comparing the training AUC value with the validation AUC value. If the validation AUC is significantly smaller then we start to become very skeptical of the model.
The influence of the top feature in the model
This metric is the top indicator of having a false friend in your data. If the most influential predictor of your model has a very high weight in the overall prediction, this is often a sign of target leakage. Note that this must not be the case, as there might just be a very good explanatory variable in the data set.
Our takeaway from our analysis was that if the weight of the top predictor is greater than 0.70, you should look very carefully at that predictor and examine whether it could be a “False Friend”. Typically, this can be done by looking at the data-generating process of that feature and decide whether the data for this feature could have been accidentally timestamped with a wrong (earlier) date/time.
We are very aware that these 4 metrics are merely the canary in the coal mine; They are base indicators to check if something went terribly wrong in the process. They are not a green light for putting the model into practice.
Before doing this, you need to ask yourself: “Do the results of this model help my specific business case?". The answer to this question is often much harder to find than evaluating the statistical soundness of the model.
Hier geht es zum Original-Artikel auf medium.
 Nisbet and Elder (Nisbet, R., Elder, J. and Miner, G. 2009. Handbook of Statistical Analysis and Data Mining Applications. Academic Press.) cite these issues under their “Top 10 machine learning mistakes”