Ensemble learning algorithms like XGBoost or Random Forests are among the top-performing models in Kaggle competitions. How do they work?
Fundamental learning algorithms as logistic regression or linear regression are often too simple to achieve adequate results for a machine learning problem. While a possible solution is to use neural networks, they require a vast amount of training data, which is rarely available. Ensemble learning techniques can boost the performance of simple models even with a limited amount of data.
Imagine asking a person to guess how many jellybeans there are inside a big jar. One person’s answer will unlikely be a precise estimate of the correct number. Instead, if we ask a thousand people the same question, the average answer will likely be close to the actual number. This phenomenon is called the wisdom of the crowd [1]. When dealing with complex estimation tasks, the crowd can be considerably more precise than an individual.
Ensemble learning algorithms take advantage of this simple principle by aggregating the predictions of a group of models, like regressors or classifiers. For an aggregation of classifiers, the ensemble model could simply pick the most common class between the predictions of the low-level classifiers. Instead, the ensemble can use the mean or the median of all the predictions for a regression task.
By aggregating a large number of weak learners, i.e. classifiers or regressors which are only slightly better than random guessing, we can achieve unthinkable results. Consider a binary classification task. By aggregating 1000 independent classifiers with individual accuracy of 51% we can create an ensemble achieving an accuracy of 75% [2].
This is the reason why ensemble algorithms are often the winning solutions in many machine-learning competitions!
There exist several techniques to build an ensemble learning algorithm. The principal ones are bagging, boosting, and stacking. In the following…