k-Nearest Neighbours (k-NN): A detailed overview

EnjoyAlgorithms Blog Cover Image

k-Nearest Neighbor is a Supervised Machine Learning algorithm that can be used to solve classification and regression problems. It is probably the first “machine learning” algorithm developed. The unique thing about this algorithm is, it learns but without explicitly mapping input variables to the target variables.

Key Takeaways from this blog

After going through this article, we will conceptualize ourselves with

  1. What is the k-NN algorithm, and why it falls under the family of instance-based learning?
  2. What are the common assumptions that are being made here?
  3. The step-wise learning process of k-NN.
  4. How the value of k affects the k-NN algorithm?
  5. What are the Voronoi cell and Voronoi diagrams?
  6. How can it be used to solve the regression problems?
  7. Implementation of the k-NN algorithm on Iris dataset to get real-hands on.

So let’s start without any further delay.

k-NN falls under the family of Instance-based learning (also known as memory-based learning). Instead of performing explicit generalization, it compares new problem instances with instances seen in training based on the stored memory. They are also called lazy algorithms, as computations only happen when we receive new observations. k-NN is a non-parametric algorithm, as it does not make any assumption on the given data.

Assumptions used in k-NN

  • Every instance that is part of the training data is mapped to an n-dimensional space in R^n.
  • The nearest neighbors are defined in terms of Euclidean Distance, Manhattan Distance, or Hamming Distance. The choice of distance matters a lot and can change the prediction.

Distance Formulas

Working of kNN

The k-NN algorithm at the training phase stores the dataset, and when it gets a new query, it classifies that query into a class similar to the existing query.

k-NN in action

Fig. 1: 1-NN and 4-NN

Consider an example (in Fig.1). Initially, the entire training dataset is considered and mapped in an R² space of positive and negative classes. The test case xq is then classified using 1-NN and 4-NN classifiers. The results for both are different, as we see that xq is classified as +ve for 1-NN and -ve for 4-NN.

Effect of k in k-NN

The value of k in the k-NN algorithm can be chosen from a range of [1,len(sample data)]. A small value of k in the k-NN algorithm means that the model is overfitting and is vulnerable to outliers. This model will have high variance and low bias. On the other hand, a model with a high value of k will have low variance and high bias and will result in underfitting. As k is increased from 1 to the number of training samples, the model will start smoothing the boundary surfaces.

k = 1: A model with k=1 will have 0 training error and hard boundaries for determining the class of test query.

k = len(sample data): This model will be highly biased towards the majority class ( having the higher number of samples) and lose accuracy.

Note: It is advisable to keep the k values as odd to reduce the chances of getting a tie.

Effect of k in k-NN

Feature Scaling could affect the prediction

If we train the k-NN algorithm on unscaled data, there can be a case where different attributes lie in different scales. But while calculating distance, we don’t care about the scales, making our model biased. To avoid that, it is always advisable to standardize the attributes before applying the k-NN algorithm.

Feature Scaling Importance

Source: Scikit-learn.org

Hypothesis Space (Voronoi Diagram) for kNN

The k-NN algorithm does not form any explicit Hypothesis function ( a mapping function that the machine learns). For a dataset in R², the hypothesis space is a polyhedron formed using the training samples.

Voronoi Cell: Suppose the training set is “T” and the elements of that training set is “x”. Then Voronoi Cell of x is a polytope (a geometric shape with “flat” sides) consisting of all points closer to x than any other points in T.

Voronoi Cell

Voronoi Cells cover the entire training space of T, and the union of these cells forms Voronoi Diagram.

Voronoi Diagram

Steps to form the Voronoi Diagram:

  1. Examine the region where you think decision boundaries should occur. These are regions between two opposite classes.
  2. Identify pairs of oppositely labeled (+/-) classes and draw perpendicular bisectors of these oppositely labeled points.
  3. Extend all lines till they join and then draw the decision boundary.

Consider the example below, where two different classes labeled (+/-) are represented in the two-dimensional coordinate space. Points are:

Data Points

Step 1: First, the points are plotted in the graph.

Voronoi Diagram implementation

Step 2: Perpendicular bisectors between oppositely labeled points are drawn.

(A,C),(B,C),(D,C),(F,C), (B,E), (F,E)

Step 3: The boundaries are extended to form the Voronoi diagram. As the Voronoi Diagram represents 1-NN, the decision boundaries are sharp. Increasing 1 to higher values will smoothen the decision boundaries.

k-NN for Regression tasks

So far, we have discussed how we could use a k-NN algorithm to solve the classification tasks, but this machine learning algorithm can also solve regression problems. In such a case, we need to tweak the approach. Instead of counting the k-nearest class labels, what if we average the data over k-neighbors? Yes! It will act as the regression model in such a scenario.

k-NN Regression

Strengths of the k-NN algorithm

  1. Zero training time: A very little training time is required compared to the other machine learning algorithms.
  2. Sample efficiency: There is no need to be a very high training sample.
  3. Explainable: At each step, the reason for the prediction can easily be depicted. Such explainability is really rare.
  4. Easy to add and remove the data: For other machine learning models, data addition requires retraining of the model. While in k-NN, we can directly update the memory and perform the inference.
  5. Less sensitive to class imbalance: Suppose we have 2 classes and one class has very higher instances in the dataset than others. k-NN, unlike other ML algorithms, is least affected by such class imbalances.

Disadvantages of k-NN algorithm

  1. Needs a lot of storage: k-NN stores the whole training data in its memory and performs inference based on that.
  2. Predictions are Slow: The time complexity of k-NN is O(dN), where d is the dimension and N is the total number of samples. More the data more will be the prediction time.
  3. The nearest neighbors can be fooled by irrelevant features.

kNN Implementation in Python using sklearn

Importing the necessary dataset, libraries, and visualization of the dataset.

The dataset used to implement k-NN is the iris dataset which is imported from the datasets as load_iris. The iris dataset is a standard dataset that is used for testing many algorithms. On the input side, it has four features, while the output label is among three classes. Other libraries are imported for training, preprocessing, and evaluation.

import matplotlib.pyplot as plt   # update the plot 
from sklearn import datasets# read the data 
import numpy as np #for arrays 
from sklearn.model_selection import train_test_split # split the data 
from sklearn.preprocessing import StandardScaler # scale the data 
from sklearn.neighbors import KNeighborsClassifier # the algorithm 

from sklearn.metrics import accuracy_score  #grade the results 
import pandas as pd 

iris = datasets.load_iris() # read the data 

X = iris.data[:]  # select the features to use 
y = iris.target   # select the classes


iris_dataframe = pd.DataFrame (data= np.c_[iris['data'], iris['target']],

    columns= iris['feature_names'] + ['target'])

plt.figure(2)
grr = pd.plotting.scatter_matrix(iris_dataframe,
                                  c=iris["target"], 
                                  figsize=(15, 15),
                                  marker='o', 
                                  S=60,
                                  alpha=.8)
plt.show(2)

This dataset has four variable measurements — sepal length, sepal width, petal length, and petal width-describing iris plants of three types: Setosa, Versicolour, and Virginica. The dataset contains 150 observations, or samples, with each observation labeled as the actual type of the plant. The dataset, which has four dimensions, is visualized pairwise to distinguish them. The pairwise scatter plot matrix of the iris data set helps visualize the relationship among multiple variables separately within subdivisions of the dataset. In the dataset visualization, violet color represents Setosa, green represents Versicolour, and yellow represents Virginica.

Pairwise comparison for different features

Data Preprocessing

The entire dataset is initially split into the training and testing part using the traintestsplit. A standard scaler is used in the next step, StandardScalar( ), to standardize the data (column-wise). When fit to a dataset, the function will transform the dataset to mean μ = 0 and standard deviation σ = 1.

A dataset with having N samples and m features,

Distance Calculation

Thus every data is then updated as,

Standardization

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25,random_state=0)


SC = StandardScaler()# create the standard scaler 
sc.fit(X_train) # fit to the training data 
x_train_std = sc.transform(X_train) # transform the training data 
X_test_std = sc.transform(X_test) # same transformation on test data

Model Fitting and Evaluation

The model is fitted into k-NN for different values of k ranging between 1 to the number of samples in the testing dataset. The metric “Minkowski” along with p = 2 represents the Euclidean distance in the R-space. The model will be fitted on different values of k and then is used to predict the output for a test sample size.

accuracyTest = {}; accuracy Train = {} 

for k in range (len (y_test):

    knn = KNeighborsClassifier(n_neighbors=k+1, p=2, metric='minkowski')
    knn.fit(X_train_std,y_train)
    y_pred = knn.predict(x_test_std) 
    y_train_pred = knn.predict(X_train_std) 

    if (k+1)%10==0:
        print(10*'-')
        print("For k = %s" %(k+1))
        print('Number in test ', len(y_test))
        print('Misclassified samples: %d' % (y_test != y_pred).sum())

    accTrain = accuracy_score(y_train,y_train_pred)
    acc = accuracy_score(y_test, y_pred)
    accuracyTest[k+1] = acc
    accuracyTrain[k+1] = accTrain

for accuracy in [accuracy Train, accuracy Test]:
    lists = sorted(accuracy.items() # sorted by key, return a list of tuples 
    X, y = zip(*lists) # unpack a list of pairs into two tuples 
    plt.plot(x, y)
    plt.show()

Training and Testing accuracy

Decision Boundaries for k-NN

The two datasets (training and testing) are combined to show the effect of varying k in the k-NN algorithm. For visualization, only two features (petal length and petal width) are considered. The value of k taken is [1,25,50,75,100,112], where the training sample size is 112. The decision boundary at k = 112 returns the majority of the three classes (red)

X = iris.data[:, [2,3]] # select the features to use 
y = iris.target         # select the classes

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25,random_state=0)


SC = StandardScaler()# create the standard scaler 
sc.fit(X_train) # fit to the training data 
x_train_std = sc.transform(X_train) # transform the training data 
X_test_std = sc.transform(X_test) # same transformation on test data

X_combined_std = np.vstack((X_train_std, X_test_std))
y_combined = np.hstack((y_train, y_test))

print('Number in combined ', len(y_combined))
# check results on combined data 
y_combined_pred = knn.predict(X_combined_std)

print('Misclassified combined samples: %d' 1 % (y_combined != y combined_pred). sum )
print('Combined Accuracy: %.2f' % accuracy_score(y_combined, y_combined_pred)) 
# visualize the results 

for k in [1,25,50, 100, len(X_train)]:

    knn = KNeighborsClassifier (n_neighbors=k, p=2, metric='minkowski')

    knn.fit(X_train_std, y_train) 

    plot_decision_regions(X=X_combined_std, y=y_combined, classifier=knn,
                                test_idx=range(105,150))

    plt.xlabel('petal length [standardized]') 
    plt.ylabel('petal width [standardized]') 
    plt.title('k=%s'%k) 
    plt.legend(loc='upper left') 
    plt.show()

Decision boundary

Possible Interview Questions

As we stated, this algorithm brings a lot of explainability with itself. Interviewers can ask deeper questions on this topic. Some of them could be,

  1. How k-NN is different from other Machine Learning algorithms?
  2. Will changing the distance metric affect the classification accuracy?
  3. Is k-NN highly sensitive to data normalization? 
  4. Why it is a non-parametric algorithm?
  5. What are the major cons of the k-NN algorithm?

Conclusion

In this article, we have covered the concept of the first “Machine Learning” algorithm, i.e., k-Nearest Neighbour. We saw how we can define the instances as neighbors and how the value of k affects the predictions. We also discussed why feature scaling played an important role and learned about the Voronoi Diagram using which k-NN learns. After that, we discussed the regression use-case of k-NN and the sort of advantages and disadvantages it brings. Finally, we implemented the k-NN algorithm on the famous Iris dataset. We hope you enjoyed the article.

References

  1. Scikit-learn: Machine Learning in Python, Pedregosa, et al., JMLR 12, pp. 2825–2830, 2011
  2. Mitchell, T. M. (1997). Machine learning., McGraw Hill series in computer science New York: McGraw-Hill.
  3. UCI Machine Learning Repository: Iris Data Set.
  4. J. D. Hunter, “Matplotlib: A 2D Graphics Environment”, Computing in Science & Engineering, vol. 9, no. 3, pp. 90–95, 2007.

Enjoy Learning! Enjoy Algorithms!

We'd love to hear from you

More content from EnjoyAlgorithms

Our weekly newsletter

Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops and math. enjoy learning!