I used to be confused when people talked about No Free Lunch theorems in machine learning. The gist of this idea is that no one machine learning algorithm will ever be best for all modelling problems. I always thought that this would be better described as No Silver Bullet or Can’t Win Them All. This idea was originally described by Wolpert & Macready in their 1997 paper “No Free Lunch Theorems for Optimization“, which is quite a nice read:
“We have dubbed the associated results No Free Lunch theorems because they demonstrate that if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems.”
Whenever I teach machine learning the No Free Lunch idea usually comes up pretty early on when students ask why there are so many different machine learning algorithms for building predictive models? Tools like RapidMiner make this very explicit when they show the number of options they provide.
And it’s a reasonable question really. Machine learning research has a history of well over 50 years now and you might think that we would have figured out one or two definitive approaches. The No Free Lunch idea, however, still looms over everything we do.
Inspired by Tom Fawcett’s machine learning gallery (which unfortunately doesn;t seem to be online any more) we demonstrated this in our book, Fundamentals of Machine Learning for Predictive Data Analytics, using a series of artificial datasets and simple predictive models. The image below shows how we generated the first of these. We defined a simple feature space composed of two features, F1 and F2, and separated this feature space in a good area and a bad area using a straight line. Using this separation of the feature space we then generated an artificial dataset, shown below, where triangles belong to the bad category and crosses belong to the good category. (Just think of Santa in the North Pole getting stuck into some machine learning to figure out the good and bad lists at Christmas!)
Using this artificial dataset we can then train models of different types to see how ell they can capture the separation into good and bad categories (this is the real benefit of this simple two dimensional artificial example). The image below shows how a decision tree approach would manage with this data set.
We can see the characteristic step pattern of a decision tree in the thick black boundary line showing the model learned by the decision tree algorithm. This boundary perfectly separates the good and bad examples in the dataset, but does not quite match the actual underlying model that separates these two groups. This means that if we try to use this model for other data drawn for this underlying distribution it won’t generalise very well.
A logistic regression learning algorithm, however, suits this dataset really well. We see that in the image below where again the decision boundary learned by the algorithm from the dataset is shown as a think black line. This almost perfectly mirrors the true underlying distribution used to generate the dataset which means that this model will generalise very well to new examples from this same distribution.
Based on this example we might conclude that logistic regression models are always the way to go. A second simple example, however, shows that this is not the case. This time we generate an artificial dataset in which the good and bad examples are separated by a stair case pattern. The underlying pattern and the dataset generated are both shown below.
This time a model learned by a decision tree algorithm almost perfectly matches the true underlying distribution. It is exactly the characteristics that of the decision tree algorithm (the step shaped decision boundary) that allow it to perform well for this problem, however, that led to degraded performance in the previous example.
Similarly, the linear characteristic of the logistic regression approach which led to such good performance in the previous examples is a poor match for this problem and leads to degraded performance.
This is what the No Free Lunch idea is all about. Unfortunately in real problems, where we have many more than two dimensions, we can’t simply plot a dataset to understand the shape of a decision boundary and so pick the best matching algorithm. This is one of the reasons that evaluation is so important in machine learning, as the only way to determine what learning algorithm will best suit a particular problem is to try a set out and compare their performance. And that’s why there is no such things as a free lunch in machine learning. But then that’s also a large part of what keeps things interesting!