Based on the nature of input that we provide to a machine learning algorithm, machine learning can be classified into 4 major categories - Supervised Learning, Unsupervised Learning, Semi-Supervised Learning, Reinforcement Learning.
SVM is one of the most popular algorithms in the domain of machine learning and data science. Since the discovery of this algorithm in the 1990s, it has been widely popular among experts. The idea behind this algorithm is very intuitive, and experts consider this one of the best “Out of the box” classifiers. In this article, we will try to develop the understanding of SVMs from a beginner level to an expert level.
Hyperplanes are a popular concept in geometry that can be considered a generalization of the plane in different dimensions. In 2D space, it's a line; in 3D space, it’s a plane, and in P dimensional space, it is a P-1 dimensional subspace.
As a fundamental nature, it separates the space into two parts. Suppose βs are the coefficients and X = [x1, x2, …., xn] is a vector in n-dimensional space. In the figure below, the left side shows the R2 (Real) space where equation 1 is the linear equation, and similarly, on the right side, equation 1 is the plane equation in R3 space. If a point satisfies equation 1, then it will lie on the line (in R2) (or on the plane (in R3)). Otherwise, it will satisfy either equation 2 or 3, and that will be a deciding factor on which side of the line (or plane) the point will lie.
So basically, we can say that a hyperplane can act as a binary classifier. We are taking the example of the above image. On the left side, if we can find the line equation (equation 1), we will be able to segregate two types of dots (Red and Green), and the classification problem will be solved. Isn’t it a trivial task to find the coefficients using linear regression or any other algorithm which can give us the coefficients to form the equation of that line? But there is a catch! There can be infinite such lines. See the image below. Which one do we have to choose?
Here comes the concept of the Maximal Margin Classifier.
One natural choice can be the selection of that hyperplane that is farthest from the training samples. We can calculate the perpendicular distance of all training samples on either side of the plane, and the minimum of these distances from both sides of the plane will be the margin. Let’s say W1 and W2 are the margins from green dots and red dots, respectively. We want to maximize this margin.
Now, as W1 + W2 = constant and we want both W1 and W2 to be maximum. In such a scenario, W1 = W2 = W is the perfect and optimal distance. And the hyperplane passing from this distance will be the maximal margin hyperplane. In a layman's term, we can say that the maximal margin hyperplane represents the mid-line of the widest slab that we can insert between two classes. So our overall objective is to fine-tune the coefficient values (β0,β1, β2,…, βn)so that W gets maximized. But two constraints should be taken care of:
Let’s say X = [X1, X2]’ is the test vector while classifying test samples. We need to put these values in the hyperplane equation. The output value of the hyperplane will decide the class of that test sample X (Let’s say, if positive, it will lie in the red class, and if negative, it will lie in the green class). The magnitude of the distance of that sample from the hyperplane will be considered as a confidence score for the classification. Higher the distance, the model will be more confident.
The width of 2W ( W from class 1 and W from Class 2) is rigid, which means we must not allow any single training sample to be present in this margin. Because of this hard or rigid nature, we also called this margin a Hard Margin.
While finding the maximal margin hyperplane, we must have noticed that the samples having minimum distance with the hyperplane are the only samples contributing to the formation of the hyperplane. As those samples are p-dimension vectors, we call them Support-Vectors as they support the formation of the hyperplane.
One interesting thing to note here is that the maximal margin hyperplane only depends on the support vectors from the whole bunch of training samples. The movement of other samples does not affect the hyperplane until and unless they cross the dashed margin lines. This is the sole reason because of why the Support Vector Machines algorithms perform better when the number of samples in the training data is lesser than the number of features.
So far, we have discussed the case where the two classes were perfectly separable, and it was intuitive to separate them using maximal margin hyperplane. But what if the training samples are not easily separable? In this scenario, a maximal margin classifier will not classify or separate the two labels.
What if the training samples, that are not easily separable, have some outliers or noise? Definitely, these factors are going to affect the estimation of optimal hyperplane or maximal margin hyperplane. As in the below figure, you can see how an outlier can affect the optimal hyperplane. Without an outlier, the margin will be maximum, and the model will be more sure about the predictions.
We came across the problems with which the Maximal margin classifier can suffer. But what if we say that those shortcomings can be cured? What if we say that: It could be beneficial to misclassify some of the training samples in order to classify the remaining samples perfectly.
Yes, we heard it right! Support Vector classifier does the same for you. It allows some samples to be present on the margin's wrong side or even the hyperplane's wrong side. While on the other hand, in the Maximal margin classifier, the margin was hard, and it could not allow even a single sample to be present on the wrong side of the margin. In the case of Support Vector Classifier (SVC), the margin is soft as it allows few samples to be present on the wrong side. Hence, it is also called the Soft margin classifier.
In the above figure, “1” has the softest margin, which means a larger amount of samples can be present on the wrong side of the margin. And “4” has the least soft margin, allowing fewer samples to be present on the wrong side. There will be a trade-off that the machine will try to optimize and result in an optimal situation.
Now we can say that SVC has solved the problem of the Maximal margin classifier. But can we think of a situation where SVC also does not work?
There can be dependencies among features that can not be separated by hyperplanes of discussed dimension (if features have p dimension then hyperplane in p-1 dimension). There is an obvious decision boundary in the image below, but such samples are not separable by linear decision boundaries.
One of the solutions for this type of non-linear decision boundary is increasing the feature dimension and separating labels using linear decision boundaries. For the above image, let’s engineer one extra feature.
Now, in a 3-dimensional feature space, we can separate the labels using a 2-dimensional decision boundary.
Let’s take another example where 1-Dimensional data is converted into 2-Dimensional space. In the image below, 1D data can not be separated by linear decision boundary. But when we engineer a new feature as X2 = X1², we can find one.
So finding the right kind of function to increase dimension is really important. But there are thousands of dimensions that can be explored while finding it. How will machines find the new dimension? There comes the rescuer, the Support Vector Machine.
To provide a definite answer to the above question, an extension to the concept of Support Vector Classifier is introduced that enlarges the feature space in a specific way using “kernels.”
From the above two examples, we must have learned that finding relationships among features is crucial to getting the extended feature space. To find the relationship between two vectors, one obvious intuition can be finding the dot product of two vectors, which is perfect to sense the linear relationships among vectors. But we can not just rely on the dot product as the relationship among features can be higher-dimensional. Hence, to generalize it, we write it as:
Where K is the generalized kernel function that represents the relationship among different feature spaces. If K will be the dot product, this is also called a Linear Kernel function, and SVM will be the same as the Support Vector Classifier. We can design our kernel function as well as per the project’s needs. Let’s quickly explore some popular Kernel functions.
Polynomial kernel function with d>1 gives more flexible decision boundaries w.r.t linear kernel function. Another non-linear kernel function can be,
This is also called Radial Basis Function (RBF) kernel or the Gaussian kernel function. The value of γ states how influential every training example would be. The higher value of γ considers lesser distant samples, and the smaller value will take more distant samples from the hyperplane.
Too much theory, right? But one critical concept is still pending, “The Cost Function”. We discussed that the goal of this algorithm is to find the parameters such that two objectives got fulfilled:
This goal is different from the rest of our objectives like finding the parameters such that the error between predicted and actual Probability Distribution Functions become less. As the objective is different, there is a need for a new cost function, especially for this case.
In our blog about famous loss and cost functions in machine learning, we discussed the different cost functions for classification problems, like binary cross-entropy, categorical cross-entropy, and Hinge loss. In the support vector classifier, we use the Hinge loss cost function to find the parameters for our hyperplane such that the margin becomes maximum and predictions become perfect. So let’s explore this in detail. If we summarize our objective using the mathematical formulation, we can converge to something like the equation below:
For equation 1, βs are the coefficients of the Hyperplane. Equation 2 is not really a constraint but just a supportive factor for equation 3. If any point lies on the Hyperplane, means β0+ β1Xi1 + … + βnXin = 0, then k(β0' + β1'Xi1 + … + βn’Xin = 0) for any k ≠ 0. And the third line is the constraint for the maximal margin classifier that the perpendicular distance of any point Xi should be greater than or equal to the Margin. Can you guess, for which samples, this distance will be equal to the margin? Using the advanced statistical transformations, the above three equations got summarized in the equation below:
This is known as Hinge Loss. It is widely popular in the industry and the considered to be the reason for the success of SVM over a wider variety of problems.
Too much theory for this blog. Let’s see SVM in real action now.
Here we are going to implement the support vector classifier with linear kernel function on the famous Iris dataset, which has three flower classes, Sentosa, Versicolor, and Virginica. We want to build a model which takes the input of petal length, petal width, sepal length, and sepal width and classifies the samples into three classes of flowers.
Here we need to import the necessary basic libraries of Numpy, Pandas, and Matplotlib.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Here we will import the dataset which is openly available and comes with the sklearn library. After importing the data, we can perform fundamental data analysis by plotting the scatter plot of different input features and the target variables. It can be observed that there are three flower classes represented as [0, 1, 2].
from sklearn import datasets
iris = datasets.load_iris() #loading dataset
X = iris.data[:,] #input
y = iris.target #target
iris_dataframe = pd.DataFrame(data=np.c_[iris['data'],iris['target']],
columns=iris['feature_names']+['target'])
plt.figure()
grr = pd.plotting.scatter_matrix(iris_dataframe, c=iris['target'],
figsize=(15,5),
s=60,alpha=8)
plt.show()
Scaling is always preferable before building a machine learning classification model with a Support Vector Classifier as the decision boundary depends upon the support vectors.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=0)
### Standardize the data
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
This is the most crucial step where we will be training our classifier. If we look at the implementation of SVC in the sklearn library:
sklearn.svm.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False,
tol=0.001, cache_size=200, class_weight=None, verbose=False)
There are some important parameters that need to be tuned to build this model:
#Create the SVM model
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)
#Make the prediction
y_pred = classifier.predict(X_test)
Now we can evaluate the model by using the confusion matrix.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
from mlxtend.plotting import plot_confusion_matrix
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
fig, ax = plot_confusion_matrix(conf_mat=cm, figsize=(6, 6), cmap=plt.cm.Greens)
plt.xlabel('Predictions', fontsize=15)
plt.ylabel('Actuals', fontsize=15)
plt.title('Confusion Matrix', fontsize=15)
plt.show()
The accuracy of the test data is 97.3%.
SVM is one of the most asked topics in Machine Learning and Data Science Interviews. If we explore the different platforms like Glassdoor and Quora for the interview questions asked in bigger tech firms’ interviews like Microsoft, we will find that most of the questions are related to SVM. Possible questions on this topic could be:
In this article, we discussed an out-of-the-box classifier in Machine Learning, i.e., Support Vector Machine. We learned about hyperplanes, maximal margin, support vector classifiers, and support vector machines. We also learned about different kernel functions and then implemented our SVM model on the famous iris dataset. We hope you enjoyed the article.
Enjoy Learning! Enjoy Algorithms!
Based on the nature of input that we provide to a machine learning algorithm, machine learning can be classified into 4 major categories - Supervised Learning, Unsupervised Learning, Semi-Supervised Learning, Reinforcement Learning.
Uber ride prices are not constant like public transport. We might have observed such variations while using the cab service. To calculate this variation, Uber uses a Machine Learning-powered Surge Pricing algorithm. In this article, we will build a machine learning model to predict the serge multiplier based on different weather conditions.
Customer Segmentation is splitting the organization's customer base into smaller groups that reflect similarities in their behavior. It helps businesses develop customer-focused strategies, make segment-wise decisions, and maximize the value of each customer to the company. In this blog, we explore the potential of clustering algorithms to accomplish the above task.
Principle Component Analysis (PCA) is an unsupervised learning technique to reduce data dimensionality consisting of many inter-related attributes. The PCA algorithm transforms data attributes into a newer set of attributes called Principal Components (PCs).
Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops design and mathematics.