AdaBoost and Gradient Boosting in ML

Introduction

Tree-based algorithms are a crucial aspect of the Machine Learning. We can use the Decision Trees algorithm to automate the creation of flowcharts for making predictions. These predictions can be either categorical or continuous numerical values. However, relying on a single decision tree can be risky as they may be prone to overfitting. To overcome this issue, we use an ensemble approach. This involves building multiple decision trees and combining their decisions. There are two types of ensemble approaches: Bagging and Boosting.

Key takeaways from this blog

After completing this blog, we will understand the following concepts:

What is Bagging?
What is Boosting?
Pseudo-code for boosting.
Hyperparameters for Boosting algorithms
Variants of boosting algorithms: AdaBoost and Gradient Boost and the possible improvements.

So let’s begin our journey with the bagging concept revision.

Bagging

Using the bootstrap approach, we create multiple copies of the original training dataset, fit different decision trees on these copies, and then combine their predictions to form a single predictive model. All these decision trees are treated independently.

Random forest is a variation of Bagging, where we use only a subsample of features present in the data to create DT on the bootstrapped dataset instead of all the features.

Boosting

Boosting, similar to bagging in terms of outcome, is one such approach where we ensemble the predictions made by all decision trees, but this process occurs sequentially. Every decision tree is grown using the information of the previously grown trees. It does not involve the bootstrapped data; instead, each tree fits onto a modified form of the original dataset. This modification is done based on the information present in the previously grown trees. So we can say in boosting, trees are not independent.

Now that we know what boosting is let’s understand its pseudo code to grasp its working principle.

Pseudo Code of Boosting For Regression Trees

Let’s say we are trying to solve a regression problem using boosting. Suppose we want to grow M trees sequentially(one after the other). The function we want to fit is f(x), and we have n samples of supervised data in the format of input variables as X and target variables as y. Now, let’s define the step-wise processes.

Step 1. Define the terms as:

#Initializing the function that we want to learn
fx = 0

#Initiailizing the learning variable for every sample. We will be using this variable to keep a track what has already been learnt and what next we want to learn.
rk = Yk # for k in [0, n] as we have n samples

Step 2: Fitting MDTs on our training dataset.

In boosting process, we grow trees sequentially so that whatever information previously grown trees have already learned, the newer trees do not waste time learning the same things again. It makes the overall process faster and smoother.

for i in range (0, M-1):
    # restrict the growth of tree by defining the max split
    fitting DT(i) with only d splits (d+1 terminal nodes) on (X,r)
    
    # As bossting is ensemble process, so we anyway need to combine the # predictions with certain weightage values. That weightage is λ.
    Updating the fx as, fx := fx + λ*DT(i)
    
    # As a property of boosting, we can use the learnings already       # happened by previous DTs. Hence we just need to learn things which # our previous DTs failed to learn
    Update the learning residuals as, rk := rk - λ*DT(i)(Xk)

We generally use smaller trees in boosting approach. This can be controlled using the “d” parameter in the above code. If the allowed depth is d, the terminal nodes (also called leaf nodes) will be d + 1. How?

How nodes split in the tree algorithms of Machine Learning?

After every split, we break the parent node into two child nodes. So in the figure above, we have performed four splits: One root node (blue) and three decision nodes (orange). It resulted in 5 terminal nodes (green).

In the final stage of training one DT, we need to check what things it already has learned so that we can fit our next DT on the remaining stuff to learn. In statistical learning, it is assumed that the approaches that learn slowly perform well. Hence, the term λ is introduced as a hyperparameter which tends to slow the learning residual update process, and more diverse decision trees try to capture the unlearned things from data.

Step 3: Ensembling all the DTs

# Final output of the boosted model:
fx = ∑λ*DT(i)

By looking into this pseudo code, DTs formed are not independent, and the growth of any tree strongly depends on the evolution of previously grown trees.

Hyperparameters of Boosting Approach

If we look into the pseudo-code of boosting, we have three tunable parameters that can affect the performance of our boosting algorithms. These are:

The value of M: States how many trees we want to grow sequentially. If we keep M very high, our model will suffer overfitting problems quickly, which was not the case with Bagging. In Bagging, we grow as many trees as feasible as they are independent. But here, every tree learns information not learned by earlier trees. Hence, with many trees, machines will start learning the noise in the dataset, thus getting overfitted. We can easily use cross-validation to select the appropriate value of M. If the update in learning residuals is not significant from the previous iterations, we can break the loop and return fx.

Update the learning residuals as, rk := rk - λ*DT(i)(Xk)

The value of d states how many splits are allowed for each Decision Tree. We know that the deeper the trees, the more complex will be the processing of that DT. As we are growing multiple trees in boosting, we must limit the growth to reduce the complexity. Often d=1 works well, a special case of DT, known as a decision stump, where we split the node only once.
The value of λ states the shrinkage parameter, which controls the learning speed in the boosting approach. Typical values are 0.01 and 0.001 and are sometimes tuned as per the problem demands. A minimal value of lambda requires a considerable value of B for better performance.

Algorithmic Variants of Boosting

There are three prevalent variants of boosting algorithms:

Adaboost
Gradient Boost
XG Boost

In this article, we will keep our discussion limited to Adaboost and Gradient Boost. XG Boost will be covered in a separate blog.

Adaboost (Adaptive Boosting)

AdaBoost is a boosting algorithm; hence it works on the same principle discussed above. There are two significant updates from the native variant of boosting:

In AdaBoost, we first assign equal weightage to all the training samples while training the first Decision Tree. After that, it evaluates DT and assigns higher weightage to those samples where earlier DTs struggled to perform well.
Mostly, a decision stump is used in the AdaBoost variant. But there is flexibility to choose any different learning algorithms, like SVM.

Gradient Boosting

This is another variant of boosting algorithm where we use the gradient descent algorithm to add the weak learners in a stage-wise manner such that the overall loss/error in prediction gets reduced. Let’s understand it in more detail.

How does Gradient Boosting work?

There are mainly three components of this algorithm:

Selection of the loss function: We select appropriate loss functions per the problem statement. Certain standard loss functions like Mean Squared Error (MSE) for regression and Categorical Cross-Entropy for classification.
Preparation of the weak learner for making predictions: In Gradient boosting, weak learners are Decision Trees. We greedily grow DTs by choosing the best split points. We also construct decision stumps here, but the DTs with 4-to-6 levels can also be used for better performance. The idea is simple; we need to construct the DT so that learning happens slowly.
Addition of weak learners such that the loss gets reduced: In AdaBoost, we constructed DTs sequentially on the unlearned part of data and added them after multiplying to a shrinkage parameter λ. This process is different in gradient boost. We construct multiple DTs sequentially and perform the addition only if it reduces the overall loss function. We know the functioning of traditional gradient descent, where we vary the set of parameters to reduce our loss. But in Gradient boosting, instead of parameters, we have DTs. We add a tree by parametrizing it and then modify these parameters to decrease the overall loss.

We always add a fixed amount of DTs, as boosting techniques are highly prone to overfitting. But at the same time, we need more accurate models which demand a larger number of trees. Hence, there are certain improvements in the basic Gradient boosting mechanism, so let’s look at them.

Improvements in basic Gradient boosting

Gradient boosting is a greedy algorithm that suffers from overfitting problems, and we know a technique that can solve this problem, yes, regularization. We penalize the steps and improve the performance by reducing overfitting. Apart from regularization, there are some other techniques as well. So let’s understand them.

Limiting the Tree

Our weak learners must understand the patterns in the data, but at the same time, we want to keep them weak. Hence we need to control the construction of these weak learners, i.e., trees. Fewer trees are required if we do not control their construction, but learning will not be as accurate as with many weak learners.

There can be multiple ways to do so:

Tree depth should be controlled: Deeper trees increase complexity, so shallow trees are preferred. 4–8 level trees show better results.
The number of trees should be controlled: Our model gets overfitted if we keep adding more trees. Hence we need to limit the number of additions allowed.
The number of samples per split: It’s a standard tree pruning methodology where we decide to split a node only when we have the total number of instances greater than some threshold value.
Minimum improvement in loss: As we discussed, we add a tree if it improves the overall loss. But if the decrease in loss is insignificant or below a threshold value, we can restrict the addition.

Weighted updates

We add prediction of every DT sequentially in a stage-wise manner. Instead of adding the DTs directly, we can slow down the learning by multiplying the predictions with the shrinkage parameters. It will allow us to add more trees making predictions more accurate.

Y = alpha*DT1 + beta*DT2 + gamma*DT3 + ... + error 

where alpha, beta and gamma are shrinkage parameters.

We prefer to keep these values smaller, 0.1 or 0.01.

Stochastic Gradient Boosting

If we remember the Bagging and random forest process, we build different trees on different subsamples of the dataset so that trees must be uncorrelated. Similarly, this variant becomes Stochastic Gradient boosting if we do the same in boosting.

Penalized Gradient Boosting

This additional constraint can be applied to parametrized trees along with the structural limits. Traditional DTs are not weak learners; we modify them to make them weak learners, known as regression trees. Terminal nodes of these trees store numerical values and are commonly referred to as weights. These weight values are regularized using either L1 or L2 regularization techniques. We will cover the implementation in the next blog.

Possible Interview Questions on Boosting

Boosting is a sophisticated and advanced algorithm, and knowledge about it is always a plus point. These are the following questions that can be framed on this topic:

What is Boosting, and how is it different from Bagging?
What is Ensemble learning, and how does boost an ensemble approach?
Boosting suffers overfitting problems. Why? How can we solve this problem?
What is AdaBoost, and how is it different from standard boosting?
What is Gradient boosting?
What are the possible improvements in Gradient boosting?

Conclusion

In this article, we learned about one of the advanced algorithms in Machine Learning, i.e., Boosting. This is considered a building block for more sophisticated algorithms like XG-Boost, the most used algorithm in Kaggle competitions, and gives almost comparable accuracy to neural networks. We saw the variants of boosting algorithms AdaBoost and Gradient boosting and the possible improvements that can be made. We hope you enjoyed the article.

Next Blog: Introduction to Random Forest in Machine Learning

AdaBoost and Gradient Boosting in Machine Learning