Owen Yang

If you are a clinician or a biological scientist, and feel like this is out of your comfort zone, please hear me out. Because this is obviously written for you.

Over-fitting is a big issue in prediction models or in machine learning

Previously I have asked Chat GPT what over-fitting is (here). I will try to explain again here.

The purpose of machine learning is to find a prediction model that has the best prediction of the outcome (such as predicting disease, predicting treatment failure, or predicting consumers buying certain products). It (the ‘machine’) uses a dataset in which the outcome is already known, and then can try 100s, 1000s, or 10000s of predictors (if they have data of them, of course), and find the best combination of predictors that can predict the outcome in that dataset. Machine learning people would probably call it ‘training,’ as to training the machine to develop a prediction model.

When many predictors are used, it is very easy to be able to predict the outcome nearly perfectly during training, but it does not mean the prediction model can predict perfectly when it is actually used to predict the unknown. Imaging you want to predict cancer and have this factor called personal ID number, then each individual would have their unique ID number. Using ID number you can certainly identify who have had cancer, and ‘predict’ it perfectly in that dataset. Because the ID number has absolutely no causal link to why one might have cancer, this prediction model is useless in predicting future cancer cases. If you have a lot of predictors and can differentiate individuals well enough, you will be able to ‘predict’ outcome perfectly because you can just fit your model to the data perfectly, but to an extent that has no relevance to the real use of the model. Hence it is called over-fitting.

Statistical approaches to counteract over-fitting

I bet there are many ways of counteracting over-fitting, but here I give two examples. The first one is to ‘penalise’ the model for using too many predictors or doing anything excessively to fit the data. Usually when the computer (or the machine, or SPSS or SAS, whatever) tries to ‘fit the model,’ it generates a fitness index for all the models (or model candidates) it tries, and select the model with the ‘best fit.’ Therefore, one can throw some manipulation here, and design a penalty to the fitness index when the model candidate has certain feature, for example when a model is too complex, then the priority is given to a simpler one. Remember parsimony? Read here.

A second approach that is interesting is pre- or post-selecting one factor to represent many factors. When the computer (or the machine) finds a few factors are more or less alike, it could decide to choose one of them to represent them so that the number of total predictors it needs to try is reduced. For example, when you have body weight, body mass index, waist circumference, hip circumference, body fat percentages all in your basket, in major circumstances you may not gain much by using them all.

These methods help to address some over-fitting issues and sometimes called model shrinkage or regularisation techniques. It is not easy for me to explain why they are called these names, and so you will have to look up for yourself.

Who should be doing the model shrinkage?

This is the real topic where I try to get at, but it is difficult to get to this point without explaining all the things above.

There are many reasons that we prefer one model to another. Despite the naive utilitarian view that the best model is a model that works, most of the time a model has to make sense in order to work. The machine ‘knows’ it and is trying to do it, but without the right information fed to the machine, the only thing it can do is to try the number and fiddle with the fitness index.

It is really not appropriate for a human not to look for the right information to feed the machine, but who has the ability to do so? For example, when we try to predict cancer risk in the general population, I am asked why I do not include the genomic or metabolomic factors, which has apparently been shown to predict some cancer risk. How do you answer this question? Is it appropriate to add the factors, and then ask the computer to shrink so that the over-fitting problem can be taken care of? In which scenario can you use these metabolomic factors to predict cancer risk in the general population? Who will be in charge of designing the penalty in the fitness index and how? More importantly, do we need to care about this?

It is also common to see people try to use (conventional) machine learning to predict risk of disease using previous medical history without attempt to ‘teach’ the machine what they mean, and praying the machine to make sense of it. How does the machine know why patients who are diagnosed with hepatitis A would not be given a diagnosis of jaundice? How does the machine know why patients who have a diagnosis of hypothyroidism are recorded to have ‘medication review’ instead of ‘hypothyroidism review’ if they also have hypertension and diabetes? Is your training data sufficient for the computer to know it? If not, how can you expect the computer to choose the best factor to represent others?