K-Nearest Neighbor (KNN) is a supervised machine learning algorithm that can be used for both classification and regression tasks. It is one of the oldest machine learning algorithms and is still widely used today due to its simplicity. KNN is unique in that it does not explicitly map input variables to target variables during the learning process. In this article, we will dive into the details of how KNN works.
In this article, we will cover the following topics related to the KNN algorithm in machine learning:
So let’s start without any further delay.
As we mentioned in the introduction, KNN is an algorithm that does not rely on traditional methods of learning parameters from the training data and fitting a function. Instead, it simply memorizes the training data and uses this information to classify new test samples based on their similarity to the learned training samples. Essentially, KNN can be thought of as an algorithm that relies on memorization rather than a more traditional approach to machine learning.
Instance-based learning, also known as memory-based learning, refers to the fact that the KNN algorithm compares new data samples to training data samples stored in its memory rather than using explicit generalization.
KNN is sometimes referred to as a "lazy" algorithm because it only performs computation when it receives new observations. This means that KNN simply stores all of the training data in its memory and defers calculations until it is given a new test sample to classify.
KNN is classified as a non-parametric algorithm because it learns the entire training set, meaning that its learning is not dependent on the given data. If more instances are introduced in the future, the learning will change significantly. This is a characteristic of non-parametric algorithms.
The KNN algorithm makes two main assumptions:
To understand how the KNN algorithm works, let's consider the steps involved in using KNN for classification:
Step 1: We first need to select the number of neighbors we want to consider. This is the term K in the KNN algorithm and highly affects the prediction.
Step 2: We need to find the K neighbors based on any distance metric. It can be Euclidean, Manhatten, or our custom distance metric. We will have the test sample on which we want the prediction. The Closest K samples in the training data from this test sample will be our K neighbors.
Step 3: We need to count how many neighbors are from the different classes among the selected K neighbors.
Step 4: Now, we have to assign the test data sample to the class for which the neighbors count was maximum.
We performed the prediction in these four simple steps. In summary, the KNN algorithm at the training phase stores the dataset, and when it gets a new query, it classifies that query into a class similar to the existing query.
Consider an example shown in the above image. The entire training dataset is initially considered and mapped in an R² space of positive and negative classes. The test case xq is then classified using 1-NN (1 neighbor) and 4-NN (4 neighbors) classifiers. The results for both are different, as we see that xq is classified as +ve for 1-NN and -ve for 4-NN.
In the KNN algorithm, the value of K can range from 1 to the total number of samples. A small value of K means that the model is prone to overfitting and vulnerable to outliers. This model will have high variance and low bias. On the other hand, a model with a high value of K will have low variance and high bias, resulting in underfitting. As the value of K is slowly increased from 1 to the number of training samples, the model will smooth out the boundary surfaces.
K = 1: A model with K=1 will have 0 training error and hard boundaries for determining the class of test query.
K = len(sample data): This model will be highly biased towards the majority class (with more samples) and less accurate.
Note: Keeping the K values as odd is advisable to reduce the chances of getting a tie.
KNN depends highly on the distance between data samples; hence scaling plays a vital role here. Suppose we train the KNN algorithm on unscaled data. There can be a case where different attributes lie in various scales, making our model biased towards the features with lesser magnitude values. To avoid that, it is always advisable to standardize the attributes before applying the KNN algorithm. For a better understanding, please look at this blog to visualize how distance calculation can be affected by scaling.
Other ML algorithms like linear regression, logistic regression, and SVMs try to fit a mapping function from input to output. This mapping function is also known as the Hypothesis function. But KNN is different. It does not form any explicit Hypothesis function but creates a hypothesis space. For a dataset in R², the hypothesis space is a polyhedron formed using the training samples. Let’s first understand what a Voronoi cell is.
Suppose the training set is “T,” and the elements of that training set are “x”. Then Voronoi Cell of xi is a polytope (a geometric shape with “flat” sides) consisting of all points closer to xi than any other points in T.
If we observe in the above image, initially, every cell contains a single sample which means K = 0, and as we increase the value of K, two cells merge and form a new polytope including K samples. Voronoi Cells cover the entire training space of T, and when we combine them, it will create Voronoi Diagram.
So far, we have discussed how we could use the KNN algorithm to solve the classification tasks, but this machine learning algorithm can also solve regression problems. We need to tweak the approach slightly. Instead of counting the K nearest neighbor class labels, what if we average the data over K neighbors? Yes! It will act as the regression model in such a scenario.
For example, let’s say we have a test data X for which we want to predict the continuous variable Y. Suppose we have finalized that our neighbors can only be 3 (i.e., K=3). Three neighbors from the training data are:
X1 → Y1, X2 → Y2, X3 → Y3. We should be clear that KNN is a supervised learning algorithm; hence, we will always have the corresponding labels for the input variables while training. At the time of prediction, we can average the three labels to find the corresponding label of the test data. For example, Y = (Y1 + Y2 + Y3)/3. This averaging can be replaced with techniques like median, mode, or custom approach.
The KNN algorithm is well-known for its simplicity and has several key strengths, including:
No doubt, KNN is cool, but this algorithm has some limitations. It is not the first choice among Machine Learning experts, and the reasons are:
Let’s implement the KNN algorithm in python to solve a classification problem.
The dataset used to implement KNN is the famous Iris dataset imported from the Scikit-learndatasets as load_iris. Other libraries are imported for training, preprocessing, and evaluation.
import matplotlib.pyplot as plt # update the plot
from sklearn import datasets# read the data
import numpy as np #for arrays
from sklearn.model_selection import train_test_split # split the data
from sklearn.preprocessing import StandardScaler # scale the data
from sklearn.neighbors import KNeighborsClassifier # the algorithm
from sklearn.metrics import accuracy_score #grade the results
import pandas as pd
iris = datasets.load_iris() # read the data
X = iris.data[:] # select the features to use
y = iris.target # select the classes
iris_dataframe = pd.DataFrame (data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
plt.figure(2)
grr = pd.plotting.scatter_matrix(iris_dataframe,
c=iris["target"],
figsize=(15, 15),
marker='o',
S=60,
alpha=.8)
plt.show(2)
This dataset contains four variables: sepal length, sepal width, petal length, and petal width, which describe iris plants of three types: Setosa, Versicolour, and Virginica. There are 150 observations in the dataset, each labeled with the actual type of plant.
The Iris dataset, which has four dimensions, can be visualized using a pairwise scatter plot matrix to distinguish between the dimensions. This plot helps to visualize the relationships among multiple variables within subdivisions of the dataset. In the image below, the violet color represents the Setosa class, green represents the Versicolour class, and yellow represents the Virginica class.
The entire dataset is initially split into the training and testing part using the traintestsplit function of Scikit-learn. A standard scaler is used in the next step, StandardScalar( ), to standardize the data (column-wise). When fit to a dataset, the function will transform the dataset to mean μ = 0 and standard deviation σ = 1.
A dataset having N samples and m features,
Thus every data is then updated as,
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25,random_state=0)
SC = StandardScaler()# create the standard scaler
sc.fit(X_train) # fit to the training data
x_train_std = sc.transform(X_train) # transform the training data
X_test_std = sc.transform(X_test) # same transformation on test data
We will fit the KNN model for different K values ranging from 1 to the number of samples in the testing dataset. The metric “Minkowski” along with p = 2 represents the Euclidean distance in the R-space. The model will be fitted on different values of K and then be used to predict the output for a test sample size.
accuracyTest = {}; accuracy Train = {}
for k in range (len (y_test):
knn = KNeighborsClassifier(n_neighbors=k+1, p=2, metric='minkowski')
knn.fit(X_train_std,y_train)
y_pred = knn.predict(x_test_std)
y_train_pred = knn.predict(X_train_std)
if (k+1)%10==0:
print(10*'-')
print("For k = %s" %(k+1))
print('Number in test ', len(y_test))
print('Misclassified samples: %d' % (y_test != y_pred).sum())
accTrain = accuracy_score(y_train,y_train_pred)
acc = accuracy_score(y_test, y_pred)
accuracyTest[k+1] = acc
accuracyTrain[k+1] = accTrain
for accuracy in [accuracy Train, accuracy Test]:
lists = sorted(accuracy.items() # sorted by key, return a list of tuples
X, y = zip(*lists) # unpack a list of pairs into two tuples
plt.plot(x, y)
plt.show()
If we prioritize the testing accuracy, the K > 18 decreases the testing accuracy sharply. The optimal number of neighbors can be around 15 to 18.
The two datasets (training and testing) are combined to show the effect of varying K in the KNN algorithm. Only two features (petal length and width) are considered for visualization. The value of K taken is [1,25,50,75,100,112], where the training sample size is 112. The decision boundary at K = 112 returns the majority of the three classes, which is red.
X = iris.data[:, [2,3]] # select the features to use
y = iris.target # select the classes
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25,random_state=0)
SC = StandardScaler()# create the standard scaler
sc.fit(X_train) # fit to the training data
x_train_std = sc.transform(X_train) # transform the training data
X_test_std = sc.transform(X_test) # same transformation on test data
X_combined_std = np.vstack((X_train_std, X_test_std))
y_combined = np.hstack((y_train, y_test))
print('Number in combined ', len(y_combined))
# check results on combined data
y_combined_pred = knn.predict(X_combined_std)
print('Misclassified combined samples: %d' 1 % (y_combined != y combined_pred). sum )
print('Combined Accuracy: %.2f' % accuracy_score(y_combined, y_combined_pred))
# visualize the results
for k in [1,25,50, 100, len(X_train)]:
knn = KNeighborsClassifier (n_neighbors=k, p=2, metric='minkowski')
knn.fit(X_train_std, y_train)
plot_decision_regions(X=X_combined_std, y=y_combined, classifier=knn,
test_idx=range(105,150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.title('k=%s'%k)
plt.legend(loc='upper left')
plt.show()
Although there are certain limitations, this algorithm is widely used in industries because of its simplicity. Some of these applications are:
The KNN algorithm is known for its explainability, which makes it a popular topic in interviews. Some potential questions that may be asked about KNN include:
In this article, we covered the concept of K-Nearest Neighbor (KNN), one of the first machine learning algorithms ever developed. We discussed how KNN defines instances as neighbors and how the value of K impacts the predictions. We also emphasized the importance of feature scaling and introduced the concept of Voronoi diagrams. We then explored the use of KNN for regression tasks and implemented the algorithm on the well-known Iris dataset. We hope you found this article informative and enjoyable.
Next Blog: Email Spam Classification Using KNN