Owen Yang

Most of the time, when someone needs to ask this question ‘do I need to do multiple imputations?’, the answer will be no.

However, there are situations where this could be okay, for example when your reviewers ask you to do so when there is a soft reason to do. So let us explore this.

Missing values in our study

This will not be your go-to article to tell you what multiple imputation is, but in brief, multiple imputation ,or MI, but try not to use acronym for this because it is sort of annoying, is a short term for ‘multiple imputation for missing values’ in statistical analysis. When you conduct a study and collect data, or when you acquire data from somewhere else, there could be missing values for all sorts of reasons. Having a missing value is annoying because you cannot perform analysis using these data.

Say you are asking whether daily consumption of vitamin D is associated with risk of developing cancer. You asked 1000 participants about their food intake, using a 66-item food-frequency questionnaire that asks participants how frequent they eat each of the 66 types of food (meat, veggies, fruits, etc.). Some people would just not answer some questions: you might have 50 participants not answering how often they eat avocados, 25 not answering how often they drink coffee, and so on. If you need to calculate vitamin D levels based on these questions, you are stuck.

To drop or to impute

There are two main ways to deal with missing values: to drop the participants or to impute the missing values. Dropping the participants simply means not using the data collect from them, and in this case you might lose a lot of data that you collect. This is mainly based on an assumption (or wishful thinking) that the data are missing randomly without a particular reason. The jargon is missing at random, or MAR, which is another unnecessary acronym and please be aware but try not to use them. Because the missing is random, dropping the data do not affect your conclusion. What may affect is your statistical power because there are fewer data left after dropping.

Imputing is making an educated guess

To impute the missing value is mainly to fill in the missing blank based on an educated guess. On a food questionnaire like this, the main reason of missing is because there are just so many questions and participants could only be bothered to fill in items that they actually had, and left the rest food items blank. If it seems true, then you can just fill in the blank with zeros. However, there are other situations where this is not appropriate. Depending on what makes sense, some people fill in the blank with the mean values of the rest of the participants, or mean values of participants of the same characteristics (sex, age, etc.). Based on the characteristics, one can also estimate the value using statistical models such as multiple regressions, although you can see this is similar (but not identical) to assigning the mean values of participants of the same characteristics.

If you do the estimation using a model, there will be uncertainty of your estimations (with means and standard deviations, so to speak). So instead of taking the mean, some people think it is important to factor in this uncertainty. This is typically done by creating multiple versions of datasets with imputed values, say 5 datasets, each one slightly different from others because the missing values are imputed slightly differently. They are different because they are randomly assigned based on the estimated means and standard deviations. Therefore, if you imputed 1000 datasets, you would find the imputed numbers across 1000 datasets is distributed in a way that you estimated (i.e. with the mean and standard deviations estimated by the model).

A typical procedure with this multiple imputation is to perform whatever analysis what you intended to do separately for each dataset, and use some statistical rule to combine the results. The results of each dataset will be slightly different, and they will be combined into a summary statistic, just like when you do a meta-analysis to combine results.

To trust or not to trust?

The logic behind these imputations (or estimation of missing values) is called missing not at random (or MNAR). The missing is not at random and is for some reasons. Hopefully the ‘reason’ can be captured by some knowledge that we know about these participants. But since this is not always true, there is no reason to believe that a result based on a multiply imputed data is more trustworthy.

I tend to agree to multiple imputations in two occasions (that I can think of just now). First occasion is to do it to re-assure. If there is no difference before and after complex multiple imputations, it works as a reassurance. If there is a difference, then there could be an issue that needs to be addressed. For me it is important not to say we should trust one over the other.

The second occasion might be crazy for medical researchers but it is not that crazy in other parts of the universe. Have you heard of missing at purpose? Have a look at this link and it may blow your mind (here).