Owen Yang

I am quite pleased when people actually follow the rules.

It is common that someone asks me to confirm that they should do a ‘non-parametric’ test because the data violates the assumption of a normal distribution. There seems to be an algorithm somewhere in the Universe telling people to do ANOVA or T-test, and if normal distribution is violated then do some sort of non-parametric test with Irish or French names. I tend to surprise them when I do not have an idea which is which.

I also tend not to remember the name of authors, head of department, or the even the name of my most frequent acquaintances. I need to improve on that. But that is another story.

Why do we like parametric test

I do not know why ‘we’ like parametric test, but I know why I like parametric test. For a pragmatic person, such as a clinician, a politician, or an economist, knowing whether or not a treatment cause changes in an outcome is just a romantic story. We need to know the extent to which the treatment changes the outcome, for example the extent to which prescribing statin may reduce the risk of heart attack. Parametric test in general, or by definition, tells you the amount of A that can change the amount of B. There is usually an actual formula that can calculate it. This is called ‘effect size’ i.e. the size of the effect of A on B.

If you do not care about the effect size, then you are one of the blessed ones. You tend to be some sort of scientist that thinks who deserve the credit, but not someone who actually makes the change.

Just to digress a little bit more, again you are the relative blessed ones if you only care about the relative effect size, but do not care about the actual absolute size of impact. But again this is another topic.

There is a rumour says ‘parametric test can give you smaller p values’ or ‘parametric test has greater statistical power.’ I find them not easy to prove or disapprove, but more importantly I do not think this is a good reason of selecting parametric test.

Why do we need to test for (normal) distribution

No. I would not think we need to ‘test’ the distribution all the time. I find that people tend to test to show it is a normal distribution when it may not be.

I find it most sensible to think that the reason we need to ‘see’ the distribution is normal is down to the definition of mean and standard deviation. If something is close to a normal distribution, then you can expect a bell shape with the largest number in the middle, and spread out evenly on both direction. The standard deviation tells you how spreader the curve is, and you can expect around 95% of observations fall within 1.96 standard deviation on each side, and 5% fall beyond 1.96 standard deviation on each side. Because the p value is calculated based on this expectation, it can only work in this way. In this way, p value in this case is the rate when the observed number falls outside its reference distribution. If the chance is small, say if p is 0.0001, then we believe the observed number is not from the reference distribution, but in fact from a new distribution that is ‘significantly different’ from the reference distribution.

Look at the distribution first

Therefore, do not bother doing tests. Please look at the distribution first. You can use numbers, for example, and look at medians and percentiles. You can also use histograms or box plots. When you look at this, ask yourself: if you only have the mean and the standard deviation of these numbers, would you have guessed the distribution look like this? If not, then no test should justify you using a parametric test.

An obvious follow up issue would be transformation of data (e.g. logarithmic data) to approach a normal distribution, or use parametric models with other types of distributions. But we will just stop here today.

1 thought on “Violation of distribution assumption: why?

Leave a Reply

Your email address will not be published. Required fields are marked *