Tree-based algorithms are one of the essential concepts in the Machine Learning domain. We automate the process of flowchart formation using the famous Decision Trees algorithms. Based on this flowchart, we make predictive decisions, either categorical or continuous numerical values.
But relying on the prediction made by a single decision tree can be risky as they can easily suffer from overfitting problems. Hence, we adopt an ensemble approach to improve the predictions made by decision trees. We build multiple decision trees and then take the combined decision made by all the trees. This ensemble approach can be of two types, Bagging and Boosting. Let's understand Bagging first, and then we will look at boosting in greater detail.
After completing this blog, we will understand the following concepts thoroughly:
So let's begin our journey with the bagging concept revision.
We create multiple copies of the original training dataset using the bootstrap approach, fit different decision trees on these copies, and then combine their predictions to form a single predictive model. All these decision trees are treated independently.
Random forest is a variation of Bagging, where we use only a subsample of features present in the data to create DT on the bootstrapped dataset instead of all the features.
Boosting, similar to bagging in terms of outcome, is one such approach where we ensemble the predictions made by all decision trees, but this process occurs sequentially. Every decision tree is grown using the information of the previously grown trees. It does not involve the bootstrapped data; instead, each tree fits onto a modified form of the original dataset. This modification is done based on the information present in the previously grown trees. So we can say, in boosting, trees are not independent.
Now, as we know what boosting is, let's understand its pseudo code to grasp its working principle.
Let's say we are trying to solve a regression problem using boosting. Suppose we want to grow M trees sequentially(one after the other). The function we want to fit is f(x), and for that, we have n samples of supervised data in the format of input variables as X and target variable as y. Now, let's define the step-wise processes.
#Initializing the function that we want to learn
fx = 0
#Initiailizing the learning variable for every sample. We will be using this variable to keep a track what has already been learnt and what next we want to learn.
rk = Yk # for k in [0, n] as we have n samples
In boosting process, we grow trees sequentially so that whatever information previously grown trees have already learned, the newer trees do not waste time learning the same things again. It makes the overall process faster and smoother.
for i in range (0, M-1):
# restrict the growth of tree by defining the max split
fitting DT(i) with only d splits (d+1 terminal nodes) on (X,r)
# As bossting is ensemble process, so we anyway need to combine the # predictions with certain weightage values. That weightage is λ.
Updating the fx as, fx := fx + λ*DT(i)
# As a property of boosting, we can use the learnings already # happened by previous DTs. Hence we just need to learn things which # our previous DTs failed to learn
Update the learning residuals as, rk := rk - λ*DT(i)(Xk)
We generally use smaller trees in boosting approach. This can be controlled using the "d" parameter in the above code. If the allowed depth is d, the terminal nodes (also called leaf nodes) will be d + 1. How?
After every split, we break the parent node into two child nodes. So in the figure above, we have performed four splits: One root node (blue) and three decision nodes (orange). It resulted in 5 terminal nodes (green).
In the final stage of training one DT, we need to check what things it already has learned so that we can fit our next DT on the remaining stuff to learn. In statistical learning, it is assumed that the approaches which learn slowly generally perform well. Hence, the term λ is introduced as a hyperparameter which tends to slow the learning residual update process, and more diverse decision trees try to capture the unlearned things from data.
# Final output of the boosted model:
fx = ∑λ*DT(i)
So, we can say by looking into this pseudo code that DTs formed are not independent, and the growth of any tree strongly depends on the evolution of previously grown trees.
If we look into the pseudo-code of boosting, we have three tunable parameters that can affect the performance of our boosting algorithms. These are:
Update the learning residuals as, rk := rk - λ*DT(i)(Xk)
There are three prevalent variants of boosting algorithms:
In this article, we will keep our discussion limited to Adaboost and Gradient Boost. XG Boost will be covered in a separate blog.
AdaBoost is a boosting algorithm; hence it works on the same principle discussed above. There are two significant updates from the native variant of boosting:
This is another variant of boosting algorithm where we use the gradient descent algorithm to add the weak learners in a stage-wise manner such that the overall loss/error in prediction gets reduced. Let's understand it in more detail.
There are mainly three components of this algorithm,
We always add a fixed amount of DTs as boosting techniques are highly prone to overfitting. But at the same time, we need more accurate models which demand a larger number of trees. Hence, there are certain improvements in the basic Gradient boosting mechanism, so let's look at them.
Gradient boosting is a greedy algorithm that suffers from overfitting problems, and we know a technique that can solve this problem, yes, regularization. We penalize the steps and improve the performance by reducing overfitting. Apart from the regularization, there are some other techniques as well. So let's understand them.
Our weak learners must understand the patterns in the data, but at the same time, we want to keep them weak. Hence we need to control the construction of these weak learners, i.e., trees. If we do not control their construction, fewer trees are required, but learning will not be as accurate as with a large number of weak learners.
There can be multiple ways to do so:
We add prediction of every DT sequentially in a stage-wise manner. Instead of adding the DTs directly, we can slow down the learning by multiplying the predictions with the shrinkage parameters. It will allow us to add more trees making predictions more accurate.
Y = alpha*DT1 + beta*DT2 + gamma*DT3 + ... + error
where alpha, beta and gamma are shrinkage parameters.
We prefer to keep these values smaller, 0.1 or 0.01.
If we remember the process of Bagging and random forest, we build different trees on different subsamples of the dataset so that trees must be uncorrelated. Similarly, if we do the same in boosting, this variant becomes Stochastic Gradient boosting.
This additional constraint can be applied to parametrized trees along with the structural limits. Traditional DTs are not weak learners; we modify them to make them weak learners, known as regression trees. Terminal nodes of these trees store numerical values and are commonly referred to as weights. These weight values are regularized using either L1 or L2 regularization techniques. We will cover the implementation in the next blog.
Boosting is a sophisticated and advanced algorithm, and knowledge about it is always a plus point. These are the following questions that can be framed on this topic:
In this article, we learned about one of the advanced algorithms in Machine Learning, i.e., Boosting. This is considered a building block for more sophisticated algorithms like XG-Boost, the most used algorithm in Kaggle competitions, and gives almost comparable accuracy to neural networks. We saw the variants of boosting algorithms AdaBoost and Gradient boosting and the possible improvements that can be made. We hope you enjoyed the article.
Enjoy learning, Enjoy algorithms!
When we build a solution for any regression problem, we compare its performance with the existing work using standard metrics, like measuring distance in meters, plot size in square feet, etc. Similarly, we need some standard evaluation metrics to evaluate two regression models. Some of them are MAE, MSE, RMSE, and R-Squared.
To learn a new subject, we should try to know how exactly that started. Every computer science field has a different history, reflecting the challenges that earlier researchers faced and making our journey easy. This article will discuss the 10 most interesting historical facts considered the turning points in AI and Machine Learning history.
Gradient descent in Machine Learning is one of the most basic cost optimization algorithms. Every interviewer expects you to know about it. This article has discussed how it helps us find the right set of parameters for learning in machine learning.
K-Nearest Neighbor is a supervised learning algorithm that can be used to solve classification as well as regression problems. This algorithm learns without explicitly mapping input variables to the target variables. It is probably the first "machine learning" algorithm, and due to its simplicity, it is still accepted in solving many industrial problems.
Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops design and mathematics.