In the previous article we discussed the XGBoost algorithm and showed its implementation in pseudocode. In this article we are going to implement the algorithm in Python from scratch.
The provided code is a concise and lightweight implementation of the XGBoost algorithm (with only about 300 lines of code), intended to demonstrate its core functionality. As such, it is not optimized for speed or memory usage, and does not include the full spectrum of options provided by the XGBoost library (see https://xgboost.readthedocs.io/ for more details on the features of the library). More specifically:
- The code is written in pure Python, whereas the core of the XGBoost library is written in C++ (its Python classes are only thin wrappers over the C++ implementation).
- It does not include various optimizations that allow XGBoost to deal with huge amounts of data, such as weighted quantile sketch, out-of-core tree learning, and parallel and distributed processing of the data. These optimizations will be discussed in more detail in the next article in the series.
- The current implementation supports only regression and binary classification tasks, whereas the XGBoost library also supports multi-class classification and ranking problems.
- Our implementation supports only a small subset of the hyperparameters that exist in the XGBoost library. Specifically, it supports the following hyperparameters:
- n_estimators (default = 100): the number of regression trees in the ensemble (which is also the number of boosting iterations).
- max_depth (default = 6): the maximum depth (number of levels) of each tree.
- learning_rate (default = 0.3): the step size shrinkage applied to the trees.
- reg_lambda (default = 1): L2 regularization term applied to the weights of the leaves.
- gamma (default = 0): minimum loss reduction required to split a given node.
For consistency, I have kept the same names and default values of these hyperparameters as they are defined in the XGBoost library.