Owen Yang

There is a constant battle everywhere between generalisation and individualisation.

In my own working environment, it is very common in an organisation to try to correct something with additional precaution measure: additional checklists to make sure accountants do not make mistakes, or additional checklists to make sure surgeons do not accidentally leave surgical materials inside the patient’s body. In general, adding something does improve the outcome, but the cost of this added item is frequently ignored. What tends to happen is after creation of hundreds of checklists, the workload has become too much, and an additional administrative or managerial role is then added to manage these checklists.

It is probably the same with statistic model. Adding more variables almost always lead to better prediction, but the cost of adding variables are often ignored. There are real-life cost, for example sometimes adding an expensive test may increase precision of cancer risk prediction by a fraction. There are also statistical cost, for example losing statistical power or causing over-fitting.

Sometimes there are statistical procedures to ‘penalise’ the use of more complex model. I am sure there are many ways to do it, but this is commonly designed in the ‘fitness index’ which is used to assess the extent to which the statistical model fits the data (or the extent to which the data fits the model). The fitness index is mainly a composite number to assess the difference between observed numbers (i.e. the data) and the expected numbers (i.e. the expected data created from the model). A larger difference means a worse model because the model does not fit the data well. To penalise the over-complicating the statistical model, the fitness index is then adjusted based on how complicated the model is, so for the same observed-expected difference, a better adjusted fitness is given to the simpler one. Sometimes the word ‘trimming’ is used to remove unnecessary part of the model. This is sometimes referred to as parsimony, to use the simplest prediction model to explain reasonably the most data.

Statistically, it is probably important to remember that there is no perfect way of trimming, or perfect way to penalise over-complicating the model.

The same principle can be applied beyond statistics. If we see a study that proves an expensive blood marker is a good marker of obesity, we really need to take a step back and think why we need a good marker of obesity, and why we would not just measure the weight itself. There are many legitimate reasons to find a good marker of obesity, but one should at least think about what these reasons are.

1 thought on “Parsimony: model simplicity

Comments are closed.