Measuring Model Fit

Introduction

Here we explore various methods for assessing our model training, whether during or after.

Train vs. Valid vs. Test Metrics

Recall that train, valid and test splits differ in their involvement in training a model. Its important to know that often, training metrics will appear the most positive, validation metrics often the second, and testing metrics are often the lowest, assuming they are all not equal.

The reason why they differ has to do with the training process. Recall that the training data is used for training the model, the validation split is used to verify performance and/or optimize the tuning parameters, and the test split has no involvement in training. In theory, if training goes well, all 3 metrics should be the same (and good in the context of the metric, whether that means a high percentage of accuracy, high $R^2$, etc). In practice, this is rarely the case.

All The Metrics Are Poor

This is indicative of underfitting: your model isn’t able to pick up on the signal as intended. Ensuring you have a healthy train/valid/test ratio and overall size of samples is key. If you have hyper/tuning parameters, try fiddling with those first to see if you can get some intuition on what might help. Make sure you are measuring with the right metrics: for instance, using binary cross entropy when there is multi label output might lead to your model not even making more than 1 prediction like it should. Revisiting your model choice, if you think that maybe this model might not be the best for this particular problem. For instance, if you have incredibly noisy data, a flexible model can accommodate to make a complex model that would be able to pull the signal out of all the noise.

Good Training Metric, Poor Validation Metric

This is often indicative of overfitting. Recall from the Starter Guide page on the bias-variance tradeoff that overfitting occurs when our model is trained to match the training data very well, but generalizes poorly. Why this has occured depends on many things, but sometimes these things can cause it:

Not enough training samples
Not enough variety in training samples
Incredibly complex data with lots of noise
Trained for too long

Good Training Metric, Good Validation Metric, Poor Test Metric

This is also indicative of overfitting. Here are some things that might cause it:

Not enough variety in training/validation
Not enough data to generalize
Data was not shuffled well enough for the splits

Metrics

There are many metrics, or ways to measure, model fit. Some of the most well known include accuracy, mean squared error (MSE), mean, categorical cross entropy, or AUC (Area Under Curve). How you measure performance is contextual to your problem, and often times there are many ways that might apply.

While you can certainly construct your own metrics, most data science toolkits have metrics already available as classes/methods. For instance, check out Pytorch’s list of metrics.

Metrics are often chosen in the context of the output. For instance, in classification, accuracy is an obvious choice: what percentage of images are classified correctly? Maybe we have multi-label output (that is, a given classification might have more than 1 label that applies to it), in which case we might use categorical cross entropy.

The names of metrics can vary between the various toolkits, so sometimes you will have to dig into the docs and read what metrics come with it, and what they do.