Support Vector Machines: An "out of the box" classifier

EnjoyAlgorithms Blog Cover Image

SVM is one of the most popular algorithms in the domain of Machine learning and data science. Since the discovery of this algorithm in the 1990s, it has been widely popular among machine learning experts. The idea behind this algorithm is very intuitive, and experts consider this one of the best “Out of the box” classifiers. In this article, we will try to develop the understanding of SVMs from a beginner level to an expert level.

Key takeaways from this blog would be,

  1. An explanation of the Maximal margin classifier and the relation with Support vector classifiers and Support vector machines.
  2. We generally get confused and loosely refer to the maximal margin classifier, support vector classifier, and the support vector machine as support vector machines. After this blog, this confusion will be gone.
  3. An intuitive idea about the Hyperplane, Support vectors, margin, hard margin, and soft margin.
  4. Limitation of support vector classifier and how SVMs overcome that.
  5. The intuitive idea for the hinge loss function.
  6. A basic implementation in python using Scikit-learn.

There is a lot to learn, so let’s start without any further delay. Let’s define the Hyperplane first.

Hyperplane

Hyperplanes are a popular concept in geometry that can be considered a generalization of the plane in different dimensions. In 2D space, it's a line; in 3D space, it’s a plane, and in P dimensional space, it is P-1 dimensional subspace. 

As a fundamental nature, it separates the space into two parts. Suppose βs are the coefficients and X = [x1, x2, …., xn] is a vector in n-dimensional space.In the figure below, the left side shows the R2 (Real) space where equation 1 is the** *linear equation,* and similarly, on the right side, equation **1 is the** *plane equation*** in R3 space. If a point satisfies equation 1, then it will lie on the line (in R2) (or on the plane (in R3)). Otherwise, it will satisfy either equation 2 or 3, and that will be a deciding factor on which side of the line (or plane) the point will lie.

Hyperplane idea

So basically, we can say that a hyperplane can act as a binary classifier. Taking the example of the above image, on the left side, if we can find the line equation (equation 1), we will be able to segregate two types of dots (Red and Green), and the classification problem will be solved. Isn’t it a trivial task to find the coefficients using linear regression or any other algorithm which can give us the coefficients to form the equation of that line? 

But there is a catch! There can be infinite such lines. See theimage below. Which one do we have to choose? 

Possibilities of hyperplane

Here comes the concept of the Maximal Margin Classifier.

Maximal Margin Classifier

One natural choice can be the selection of that hyperplane that is farthest from the training samples. We can calculate the perpendicular distance of all training samples on either side of the plane, and the minimum of these distances from both sides of the plane will be margin. Let’s say W1 and W2 are the margins from green dots and red dots, respectively. We want to maximize this margin.
Now, as W1 + W2 = constant and we want both W1 and W2 to be maximum. In such a scenario, W1=W2=W is the perfect and optimal distance. And the hyperplane passing from this distance will be the maximal margin hyperplane. In a layman term, we can say that the maximal margin hyperplane represents the mid-line of the widest slab that we can insert between two classes. So our overall objective is to fine-tune the coefficient values (β0,β1, β2,…, βn)so that W gets maximized. But two constraints should be taken care of:

  1. All training samples should be present on the correct side of the plane.
  2. Distance between the maximal margin hyperplane and the training samples should always be greater than or equal to W (minimum margin distance).

Let’s say X = [X1, X2]’ is the test vector while classifying test samples. We need to put these values in the hyperplane equation. The output value of the hyperplane will decide the class of that test sample X (Let’s say, if positive, it will lie in the red class, and if negative, it will lie in the green class). 

The magnitude of the distance of that sample from the hyperplane will be considered as a confidence score for the classification. Higher the distance, the model will be more confident.

The width of 2W ( W from class 1 and W from Class 2) is rigid, which means we must not allow any single training sample to be present in this margin. Because of this hard or rigid nature, we also called this margin a Hard Margin.

Support Vectors

While finding the maximal margin hyperplane, we must have noticed that the samples having minimum distance with the hyperplane are the only samples contributing to the formation of the hyperplane. As those samples are p-dimension vectors, we call them Support-Vectors as they support the formation of the hyperplane.

Support vectors

One interesting thing to note here is that the maximal margin hyperplane only depends on the support vectors from the whole bunch of training samples. The movement of other samples does not affect the hyperplane until and unless they cross the dashed margin lines. This is the sole reason because of which the Support Vector Machines algorithms perform better when the number of samples in the training data is lesser than the number of features.

The Non-Separable case

So far, we have discussed the case where the two classes were perfectly separable, and it was intuitive to separate them using maximal margin hyperplane.

But what if the training samples are not easily separable?

In this scenario, a maximal margin classifier will not classify or separate the two labels.

Non separable case

And what if the training samples, that are not easily separable, have some outliers or noise?

Definitely, these factors are going to affect the estimation of optimal hyperplane or maximal margin hyperplane. As in the below figure, you can see how an outlier can affect the optimal hyperplane. Without an outlier, the margin will be maximum, and the model will be more sure about the predictions.

Outliers effect

Support Vector Classifier

We came across the problems with which the Maximal margin classifier can suffer. But what if we say that those shortcomings can be cured? What if we say that,

It could be beneficial to misclassify some of the training samples in order to classify the remaining samples perfectly.

Yes, we heard it right! Support Vector classifier does the same for you. It allows some samples to be present on the margin's wrong side or even the hyperplane's wrong side. While on the other hand, in the Maximal margin classifier, the margin was hard, and it could not allow even a single sample to be present on the wrong side of the margin. 

In the case of Support Vector Classifier (SVC), the margin is soft as it allows few samples to be present on the wrong side. Hence, it is also called the Soft margin classifier.

Soft margin classifier

Source: Introduction to statistical learning

In the above figure, “1” has the softest margin, which means a larger amount of samples can be present on the wrong side of the margin. And “4” has the least soft margin, allowing fewer samples to be present on the wrong side. There will be a trade-off that the machine will try to optimize and result in an optimal situation.

Now we can say that SVC has solved the problem of the Maximal margin classifier. But can we think of a situation where SVC also does not work?

The case of Non-Linear Boundaries

There can be dependencies among features that can not be separated by hyperplanes of discussed dimension (if features have p dimension then hyperplane in p-1 dimension). There is an obvious decision boundary in the image below, but such samples are not separable by linear decision boundaries.

Non linear decision boundary

One of the solutions for this type of non-linear decision boundary is increasing the feature dimension and separating labels using linear decision boundaries. For the above image, let’s engineer one extra feature.

New feature engineered

Now, in a 3-dimensional feature space, we can separate the labels using a 2-dimensional decision boundary.

Circle in 3D

Let’s take another example where 1-Dimensional data is converted into 2-Dimensional space. In the image below, 1D data can not be separated by linear decision boundary. But when we engineer a new feature as X2 = X1², we can find one. 

Parabolic case

So finding the right kind of function to increase dimension is really important. But there are thousands of dimensions that can be explored while finding it. How will machines find the new dimension?

There comes the rescuer, Support Vector Machine.

Support Vector Machines

To provide a definite answer to the above question, an extension to the concept of Support Vector Classifier is introduced that enlarges the feature space in a specific way using “kernels.”

From the above two examples, we must have learned that finding relationships among features is crucial to get the extended feature space. To find the relationship between two vectors, one obvious intuition can be finding the dot product of two vectors, which is perfect to sense the linear relationships among vectors. But we can not just rely on the dot product as the relationship among features can be higher-dimensional. Hence, to generalize it, we write it as

Kernel function

Where K is the generalized kernel function that represents the relationship among different feature spaces. If K will be the dot product, this is also called a Linear Kernel function, and SVM will be the same as Support Vector Classifier. We can design our kernel function as well as per the project’s need. Let’s quickly explore some popular Kernel functions.

Linear Kernel Function

Linear Kernel function

Polynomial Kernel Function

Polynomial Kernel function

Polynomial kernel function with d>1 gives more flexible decision boundaries w.r.t linear kernel function. Another non-linear kernel function can be,

Radial Kernel Function

RBF

This is also called Radial Basis Function (RBF) kernel or the Gaussian kernel function. The value of γ states how influential every training example would be. The higher value of γ considers lesser distant samples, and the smaller value will take more distant samples from the hyperplane.

gamma effect

Too much theory, right? But one critical concept is still pending, “The Cost Function”. We discussed that the goal of this algorithm is to find the parameters such that two objectives got fulfilled,

  1. Penalize the wrong predictions.
  2. Maximize the margin. Or it can be said that penalize the less confident predictions.

This goal is different from the rest of our objectives like finding the parameters such that error between predicted and actual Probability Distribution Functions become less. As the objective is different, there is a need for a new cost function, especially for this case.

Cost Function For Support Vector Machine

In our blog about famous loss and cost functions in machine learning, we discussed the different cost functions for classification problems, like binary cross-entropy, categorical cross-entropy, and Hinge loss. In the support vector classifier, we use the Hinge loss cost function to find the parameters for our hyperplane such that the margin becomes maximum and predictions become perfect. So let’s explore this in detail.

If we summarize our objective using the mathematical formulation, we can converge to something like the equation below,

mathematics behind hinge loss

For equation 1, βs are the coefficients of the Hyperplane. Equation 2 is not really a constraint but just a supportive factor for equation 3. If any point lies on the Hyperplane, means β0+ β1Xi1 + … + βnXin = 0, then k(β0' + β1'Xi1 + … + βn’Xin = 0) for any k ≠ 0. And the third line is the constraint for the maximal margin classifier that the perpendicular distance of any point Xi should be greater than or equal to the Margin. Can you guess, for which samples, this distance will be equal to the margin?

Using the advanced statistical transformations, the above three equations got summarized in the equation below,

shortening of above 3 equations

This is known as Hinge Loss. It is widely popular in the industry and the considered to be the reason for the success of SVM over a wider variety of problems.

Hinge Loss figure

Too much theory for this blog. Let’s see SVM in real action now.

Implementation

Here we are going to implement the support vector classifier with linear kernel function on the famous Iris dataset, which has three flower classes, Sentosa, Versicolor, and Virginica. We want to build a model which takes the input of petal length, petal width, sepal length, and sepal width and classify the samples into three classes of flowers.

Iris dataset svm

Step1: Importing the necessary libraries 

Here we need to import the necessary basic libraries of Numpy, Pandas, Matplotlib.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 2: Load the dataset

Here we will import the dataset which is openly available and comes with the sklearn library. After importing the data, we can perform fundamental data analysis by plotting the scatter plot of different input features and the target variables. It can be observed that there are three flower classes represented as [0, 1, 2].

from sklearn import datasets
iris = datasets.load_iris()            #loading dataset
X = iris.data[:,]                    #input
y = iris.target                      #target
iris_dataframe = pd.DataFrame(data=np.c_[iris['data'],iris['target']],
                             columns=iris['feature_names']+['target'])
plt.figure()
grr = pd.plotting.scatter_matrix(iris_dataframe, c=iris['target'],
                                  figsize=(15,5),
                                  s=60,alpha=8)
plt.show()

Data plot

Step 3: Split the dataset and standardize it.

Scaling is always preferable before building a machine learning classification model with a Support Vector Classifier as the decision boundary depends upon the support vectors.

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=0)

### Standardize the data
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

Step 4: Building the Model

This is the most crucial step where we will be training our classifier. If we look onto the implementation of SVC in the sklearn library,

sklearn.svm.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, 
                tol=0.001, cache_size=200, class_weight=None, verbose=False)

there are some important parameters that need to be tuned to build this model,

  • C (Regularization parameter): This represents how much our SVC model is hard or soft.
  • Kernel: We can design our own kernel. Some of the famous ones are rbf (default), poly, or linear.
  • gamma (γ): This represents how much each training sample is influential in the calculation of hyperplane.
  • degree (d): In the above formula of the polynomial kernel function, degree shows the degree of the polynomial.
  • probability: The default value is False. If we set it to true, our model will provide the probability for every class, and based on the higher confidence; we can select the correct class.
  • shrinking: The default value is True and asks whether to use the shrinking heuristic or not. It shows optimization techniques for the SVM model and is advised to keep it true for performance benefits.
  • random_state: The default value is None. It controls the pseudo-random number generation for shuffling the data for probability estimates. When probability = False, random_state is ignored.
#Create the SVM model
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)
#Make the prediction
y_pred = classifier.predict(X_test)

Step 5: Evaluation of Built Model

Now we can evaluate the model by using the confusion matrix.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
from mlxtend.plotting import plot_confusion_matrix
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
fig, ax = plot_confusion_matrix(conf_mat=cm, figsize=(6, 6), cmap=plt.cm.Greens)
plt.xlabel('Predictions', fontsize=15)
plt.ylabel('Actuals', fontsize=15)
plt.title('Confusion Matrix', fontsize=15)
plt.show()

Confusion matrix

Accuracy on the test data is 97.3%.

Possible Interview Questions

SVM is one of the most asked topics in Machine Learning and Data Science Interviews. If we explore the different platforms like Glassdoor and Quora for the interview questions asked in bigger tech firms’ interviews like Microsoft, we will find that most of the questions are related to SVM. Possible questions on this topic could be,

  1. Explain the loss/cost function of the Support Vector Machine.
  2. What are “support vectors” in Support Vector Machine?
  3. What’s the role of Regularization parameter C in SVM?
  4. What is the role of gamma (γ) in SVM?
  5. What are Kernels in SVM?
  6. What is a Margin and what’s the difference between Soft and Hard margins?

Conclusion

In this article, we discussed an out-of-the-box classifier in Machine Learning, i.e., Support Vector Machine. We learned about hyperplanes, maximal margin, support vector classifier, and support vector machines. We also learned about different kernel functions and then implemented our SVM model on the famous iris dataset. We hope you enjoyed the article.

Enjoy Learning! Enjoy Algorithms!

We'd love to hear from you

Our weekly newsletter

Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops and math. enjoy learning!