Machine Learning vs Statistics : Culture clash?
Traditional statistics has been around for hundreds of years, and practitioners have been plugging away at trying to develop algorithms that help us to extract meaning from data for just as long. Much of their work went unnoticed by the general public until recently, as their theorems and models seldom presented practical solutions to problems that involve the messy data we often encountered in the wild. But recently developments in computer hardware and software has made the field of statistics touch almost every aspect of our modern lives.
This new wave of activity has been the engine fueling the data science revolution currently underway. Algorithms such as gradient boosted trees and neural networks have cast a long shadow over the ANOVA’s (analysis of variance) or logistic regressions that served statisticians so well for many decades. Is it time to consign these antiquated methods to the rubbish heap? Do they have any relevance in the modern era of big data, artificial intelligence and cloud computing? Statisticians Brian Ripley and Andrew Gelman had this to say:
'Machine learning is statistics minus any checking of models and assumptions'. - Brian D. Ripley
‘In that case, maybe we should get rid of checking of models and assumptions more often. Then maybe we'd be able to solve some of the problems that the machine learning people can solve but we can't!’ - Andrew Gelman
I would argue that these traditional statistical methods still have a very important role to play, a role somewhat different from the role of modern machine learning algorithms. While doing research as a scientist, most of my analytical work involved the use of traditional regression techniques such as ANOVA or multiple regression. Now, working as a data scientist, I rely mostly on gradients boosted trees and neural networks to make predictions. Different techniques are used because these two different realms are fundamentally interested in different outcomes. As a scientist my principal interest is inference – or the relationship between predictors and outcomes. As a data scientist, I am mostly interested in prediction – for example, my clients want to know who is going to purchase their product and who isn’t.
Sometimes we get carried away trying to produce the most accurate predictions using machine learning methods without stopping to think about how the outcomes we are modelling are affected by potential influencing factors. Often the methods we choose to make our predictions with don’t even allow us to explore the relationship between these factors and the predictions. On the other hand, as statisticians, we can be so concerned that the assumptions required to infer causality between input and outputs are met, that we forget to check that our outputs are indeed making predictions that are useful and can be implemented in practice.
Inference and Prediction should not be regarded as mutually exclusive tasks when building models. It’s time to pause and view machine learning and statistics as two sides of the same coin.