Owen Yang

Discuss with me if you disagree.

Lately there has been a practice in machine learning to split data randomly into halves. The first half is used to develop a prediction model (the ‘training’ dataset), and the second half is used to test whether the prediction model can be replicated (the ‘validating’ dataset). A successful replication seems to suggest some validity of the prediction model. There are so many models now, and there have been resentful comments from the grassroots model creators regarding why their brilliant models are not used to save the world (or make money).

Random split validation is next to useless

Although it is always good to check whether your prediction model can be replicated, my concern is most (not all) people may not seem to understand what they are doing. The idea of randomly splitting data into halves is to generate two identical databases in principle, except the exact numbers are not the same. Since we have generated two identical databases, it is only reasonable that the prediction model developed from one half can be used in the other half. The fact that we need to convince ourself by testing is because we have so little confidence. There are a few reasons of this low confidence, such as not-big-enough data, wild model selection, and lack of knowledge on the subject matter. This can be another big topic of controversy, which I probably should spare them for now. But this is also statistic 101: randomness only works when the number of numbers is effective sufficient in every way, and it is not enough just having a physically large data.

Better replication is key

We really need to be able to replicate in a different (or ‘independent’) data to be sure the prediction model works and is ready to save your world.

Carry out a new wave of data collection and test it in different populations. Test in a different country.

But this is such a hard work compared to splitting data and run algorithms. We have a boom of academics, funders, and peer-reviewed journals who think it is cheaper and more productive to hire data processors than data collectors. Even worse, the society generally rewards those who ‘invented’ the original predictive model, but not the one who validates the predictive model.

Therefore, we now have exponentially increasing prediction models generated from very finite data, each of which have no intention for real replication but hope some day in the future someone will prove that ‘they have been right from the beginning.’

We really need to reward the data collectors and model replicators.

Do not let data scientists lead if they believe in agnostics

Agnostic means very differently at different level of intelligence. An agnostic Mozart and an agnostic rabbit plays different music.

Initially people had this dream of pure agnostics, but see what happens now. They then create this fancy word of ‘neuronal network’ etc so that AI should learn the basics before they move on to complicated responses. They learn to identify lines and dots, and then to identify eyes and hair, and face, and then a person. The rabbit has become Mozart, albeit still chanting agnostic songs.

So agnostics is not good or bad. What I am saying is if you are going down to the agnostic way, make sure you are not the rabbit.

Leave a Reply

Your email address will not be published. Required fields are marked *