Owen Yang

Another article is here that will (although it is not aimed to) annoy genetic statisticians.

So co-localization analysis in genetics is basically to assess whether two traits are associated (or correlated) with one gene. It is called co-localization because the genetic locus of two traits are the same. It makes much more sense in marketing field you ask whether McDonald’s and KFC tend to be at the same ‘locus.’

Casual or not causal?

It can be annoying sometimes when one would like to jump to causality. First of all, there is a theory that most gene variants are randomly distributed in a well-mixed population, and so it can work as a randomised control trial and unconfounded. Therefore when a gene variant is associated with a trait, say variants of gene RPS26 and type 1 diabetes, many people would like to conclude this is causal, i.e. the gene variant causes the trait. Even when the causation is true, we need to be careful because here causality may not mean what we might think intuitively. Unfortunately this is not the place to discuss it.

Let us just agree this causality in principle. When two traits, say type 1 diabetes and autoantibodies, are associated with gene variants that are very close to each other, it could be that the same gene causes two traits. However, this is slightly different from the fact that there are people using this approach trying to suggest that one trait causes another (e.g. autoantibody causes type 1 diabetes). It will be important to understand that the co-location analysis does not provide direct evidence to this trait-trait causality.

No perfect way to do it

Then there is technicality how to tell whether the two traits actually share the influence from the same gene, or it is just a statistical co-incidence. A typical co-localisation study nowadays can be using multiple GWAS-level data to prove gene-trait association, which itself have issues on reproducibility. We would then require two (or more) gene-trait associations to identify co-localisation. I genuine cannot tell whether this is more or less credible. You may say it is so random that you cannot trust the results anymore. On the other hand, you may also say because it is so random, finding a common gene between two traits is so unlikely that it has to be true. For me, because causality can come in so many useful and trivial forms, I really do not care whether this is true at this stage.

The most popular algorithm for a co-locolisation seem to be Coloc. One can always challenge that previous method is not perfect, and then add a little adjustment on it, and so create different algorithms claiming they are the best. For example, one can take into account what the gene is, the location and the size of the gene, the local structure of the gene that may cause random associations, the complex downstream mechanisms of the gene that may cause random associations, the fact that there can be more than one locus in a location associated with the trait, or the fact that a non-association can be due to under-powered analysis. It will never be perfect, but each method will tend to say they are the best.

The usual cliche of overselling the findings

A sad thing is that no one actually has the knowledge of all proteins and genes, and therefore when a shared gene is ‘declared’ by a systemic analyst, it should always be difficult for anyone to explain the extent to which it is true or relevant. Journals or funders has not learned not to judge based on the outcome, but based on the value of data and methods.

In my opinion, let us keep it descriptive and let other specialists to interpret the data. There is no need for a data scientist to declare the link of two biological mechanisms.