Tree-based algorithms are one of the essential concepts in the Machine Learning domain. We automate the process of flowchart formation using the famous Decision Trees algorithms. Based on this flowchart, we make predictive decisions, either categorical or continuous numerical values.

But relying on the prediction made by a single decision tree can be risky as they can easily suffer from overfitting problems. Hence, we adopt an ensemble approach to improve the predictions made by decision trees. We build multiple decision trees and then take the combined decision made by all the trees. This ensemble approach can be of two types, Bagging and Boosting. Let's understand Bagging first, and then we will look at boosting in greater detail.

### Key takeaways from this blog

After completing this blog, we will understand the following concepts thoroughly:

• What is Bagging?
• What is Boosting?
• Pseudo-code for boosting.
• Hyperparameters for Boosting algorithms
• Variants of boosting algorithms: AdaBoost and Gradient Boost and the possible improvements.

So let's begin our journey with the bagging concept revision.

### Bagging

We create multiple copies of the original training dataset using the bootstrap approach, fit different decision trees on these copies, and then combine their predictions to form a single predictive model. All these decision trees are treated independently.

Random forest is a variation of Bagging, where we use only a subsample of features present in the data to create DT on the bootstrapped dataset instead of all the features.

### Boosting

Boosting, similar to bagging in terms of outcome, is one such approach where we ensemble the predictions made by all decision trees, but this process occurs sequentially. Every decision tree is grown using the information of the previously grown trees. It does not involve the bootstrapped data; instead, each tree fits onto a modified form of the original dataset. This modification is done based on the information present in the previously grown trees. So we can say, in boosting, trees are not independent.

Now, as we know what boosting is, let's understand its pseudo code to grasp its working principle.

## Pseudo Code of Boosting For Regression Trees

Let's say we are trying to solve a regression problem using boosting. Suppose we want to grow M trees sequentially(one after the other). The function we want to fit is f(x), and for that, we have n samples of supervised data in the format of input variables as X and target variable as y. Now, let's define the step-wise processes.

### Step 1. Define the terms as:

``````#Initializing the function that we want to learn
fx = 0

#Initiailizing the learning variable for every sample. We will be using this variable to keep a track what has already been learnt and what next we want to learn.
rk = Yk # for k in [0, n] as we have n samples
``````

### Step 2: Fitting M DTs on our training dataset.

In boosting process, we grow trees sequentially so that whatever information previously grown trees have already learned, the newer trees do not waste time learning the same things again. It makes the overall process faster and smoother.

``````for i in range (0, M-1):
# restrict the growth of tree by defining the max split
fitting DT(i) with only d splits (d+1 terminal nodes) on (X,r)

# As bossting is ensemble process, so we anyway need to combine the # predictions with certain weightage values. That weightage is λ.
Updating the fx as, fx := fx + λ*DT(i)

# As a property of boosting, we can use the learnings already       # happened by previous DTs. Hence we just need to learn things which # our previous DTs failed to learn
Update the learning residuals as, rk := rk - λ*DT(i)(Xk)
``````

We generally use smaller trees in boosting approach. This can be controlled using the "d" parameter in the above code. If the allowed depth is d, the terminal nodes (also called leaf nodes) will be d + 1. How? After every split, we break the parent node into two child nodes. So in the figure above, we have performed four splits: One root node (blue) and three decision nodes (orange). It resulted in 5 terminal nodes (green).

In the final stage of training one DT, we need to check what things it already has learned so that we can fit our next DT on the remaining stuff to learn. In statistical learning, it is assumed that the approaches which learn slowly generally perform well. Hence, the term λ is introduced as a hyperparameter which tends to slow the learning residual update process, and more diverse decision trees try to capture the unlearned things from data.

### Step 3: Ensembling all the DTs

``````# Final output of the boosted model:
fx = ∑λ*DT(i)
``````

So, we can say by looking into this pseudo code that DTs formed are not independent, and the growth of any tree strongly depends on the evolution of previously grown trees.

## Hyperparameters of Boosting Approach

If we look into the pseudo-code of boosting, we have three tunable parameters that can affect the performance of our boosting algorithms. These are:

• The value of M: It states how many trees we want to grow sequentially. If we keep M very high, our model will suffer overfitting problems easily, which was not the case with Bagging. In Bagging, we grow as many trees as feasible as they all are independent. But here, every tree learns information that is not learned by earlier trees. Hence, with many trees, machines will start learning the noise present in the dataset, and thus it will get overfitted. We can easily use cross-validation to select the appropriate value of M. If the update in learning residuals is not significant from the previous iterations, we can break the loop and return fx.
``````Update the learning residuals as, rk := rk - λ*DT(i)(Xk)
``````
• The value of d: It states how many splits are allowed for each Decision Tree. We know that the deeper the trees, the more complex will be the processing of that DT. As we are growing multiple trees in boosting, we must limit the growth to reduce the complexity. Often d=1 works well, which is a special case of DT, known as a decision stump, where we split node only once.
• The value of λ: It states the shrinkage parameter which controls the speed of learning in boosting approach. Typical values are 0.01 and 0.001 and are sometimes tuned as per the problem demands. A minimal value of lambda requires a considerable value of B for better performance.

## Algorithmic Variants of Boosting

There are three prevalent variants of boosting algorithms:

3. XG Boost

In this article, we will keep our discussion limited to Adaboost and Gradient Boost. XG Boost will be covered in a separate blog.

AdaBoost is a boosting algorithm; hence it works on the same principle discussed above. There are two significant updates from the native variant of boosting:

1. In AdaBoost, we first assign equal weightage to all the training samples while training the first Decision Tree. After that, it evaluates DT and assigns higher weightage to those samples where earlier DTs struggled to perform well. 2. Mostly, a decision stump is used in the AdaBoost variant. But there is flexibility to choose any different learning algorithms, like SVM.

This is another variant of boosting algorithm where we use the gradient descent algorithm to add the weak learners in a stage-wise manner such that the overall loss/error in prediction gets reduced. Let's understand it in more detail.

#### How does Gradient Boosting work?

There are mainly three components of this algorithm,

1. Selection of the loss function: We select appropriate loss functions per the problem statement. Certain standard loss functions like Mean Squared Error (MSE) for regression and Categorical Cross-Entropy for classification.
2. Preparation of the weak learner for making predictions: In Gradient boosting, weak learners are Decision Trees. We greedily grow DTs by choosing the best split points. We also construct decision stumps here, but the DTs with 4-to-6 levels can also be used for better performance. The idea is simple; we need to construct the DT so that learning happens slowly.
3. Addition of weak learners such that the loss gets reduced: In AdaBoost, we constructed DTs sequentially on the unlearned part of data and added them after multiplying to a shrinkage parameter λ. This process is different in gradient boost. We construct multiple DTs sequentially and perform the addition only if it reduces the overall loss function. We know the functioning of traditional gradient descent, where we vary the set of parameters to reduce our loss. But in Gradient boosting, instead of parameters, we have DTs. We add a tree by parametrizing it and then modify these parameters to decrease the overall loss.

We always add a fixed amount of DTs as boosting techniques are highly prone to overfitting. But at the same time, we need more accurate models which demand a larger number of trees. Hence, there are certain improvements in the basic Gradient boosting mechanism, so let's look at them.

## Improvements in basic Gradient boosting

Gradient boosting is a greedy algorithm that suffers from overfitting problems, and we know a technique that can solve this problem, yes, regularization. We penalize the steps and improve the performance by reducing overfitting. Apart from the regularization, there are some other techniques as well. So let's understand them.

### Limiting the Tree

Our weak learners must understand the patterns in the data, but at the same time, we want to keep them weak. Hence we need to control the construction of these weak learners, i.e., trees. If we do not control their construction, fewer trees are required, but learning will not be as accurate as with a large number of weak learners.

There can be multiple ways to do so:

• Tree depth should be controlled: Deeper trees increase complexity, so shallow trees are preferred. 4–8 level trees show better results.
• The number of trees should be controlled: If we keep on adding more and more trees, our model gets overfitted. Hence we need to limit the number of additions allowed.
• The number of samples per split: It's a standard tree pruning methodology where we decide to split a node only when we have the total number of instances greater than some threshold value.
• Minimum improvement in loss: As we discussed, we add a tree if it improves the overall loss. But if the decrease in loss is insignificant or below a threshold value, we can restrict the addition.

We add prediction of every DT sequentially in a stage-wise manner. Instead of adding the DTs directly, we can slow down the learning by multiplying the predictions with the shrinkage parameters. It will allow us to add more trees making predictions more accurate.

``````Y = alpha*DT1 + beta*DT2 + gamma*DT3 + ... + error

where alpha, beta and gamma are shrinkage parameters.
``````

We prefer to keep these values smaller, 0.1 or 0.01.

If we remember the process of Bagging and random forest, we build different trees on different subsamples of the dataset so that trees must be uncorrelated. Similarly, if we do the same in boosting, this variant becomes Stochastic Gradient boosting.

This additional constraint can be applied to parametrized trees along with the structural limits. Traditional DTs are not weak learners; we modify them to make them weak learners, known as regression trees. Terminal nodes of these trees store numerical values and are commonly referred to as weights. These weight values are regularized using either L1 or L2 regularization techniques. We will cover the implementation in the next blog.

## Possible Interview Questions on Boosting

Boosting is a sophisticated and advanced algorithm, and knowledge about it is always a plus point. These are the following questions that can be framed on this topic:

• What is Boosting, and how is it different from Bagging?
• What is Ensemble learning, and how does boost is an ensemble approach?
• Boosting suffers overfitting problems. Why? How can we solve this problem?
• What is AdaBoost, and how is it different from standard boosting?