** k-Nearest Neighbor** is a Supervised Machine Learning algorithm that can be used to solve classification and regression problems. It is probably the first “machine learning” algorithm developed. The unique thing about this algorithm is, it learns but without explicitly mapping input variables to the target variables.

After going through this article, we will conceptualize ourselves with

- What is the k-NN algorithm, and why it falls under the family of instance-based learning?
- What are the common assumptions that are being made here?
- The step-wise learning process of k-NN.
- How the value of k affects the k-NN algorithm?
- What are the Voronoi cell and Voronoi diagrams?
- How can it be used to solve the regression problems?
- Implementation of the k-NN algorithm on Iris dataset to get real-hands on.

So let’s start without any further delay.

k-NN falls under the family of **Instance-based learning** (also known as memory-based learning)**. Instead** of performing explicit generalization, it compares new problem instances with instances seen in training based on the stored memory. They are also called lazy algorithms, as computations only happen when we receive new observations. k-NN is a **non-parametric algorithm**, as it does not make any assumption on the given data.

- Every instance that is part of the training data is mapped to an n-dimensional space in R^n.
- The nearest neighbors are defined in terms of
**Euclidean Distance**,**Manhattan Distance**, or**Hamming Distance**. The choice of distance matters a lot and can change the prediction.

The k-NN algorithm at the training phase stores the dataset, and when it gets a new query, it classifies that query into a class similar to the existing query.

Fig. 1*: 1-NN and 4-NN*

Consider an example (in Fig.1). Initially, the entire training dataset is considered and mapped in an R² space of positive and negative classes. The test case **xq** is then classified using 1-NN and 4-NN classifiers. The results for both are different, as we see that **xq is classified as +ve for 1-NN and -ve for 4-NN.**

The value of k in the k-NN algorithm can be chosen from a range of **[1,len(sample data)]**. A small value of k in the k-NN algorithm means that the model is overfitting and is vulnerable to outliers. This model will have high variance and low bias. On the other hand, a model with a high value of k will have low variance and high bias and will result in underfitting. As **k** is increased from 1 to the number of training samples, the model will start smoothing the boundary surfaces.

**k = 1:** A model with k=1 will have 0 training error and hard boundaries for determining the class of test query.

**k = len(sample data):** This model will be highly biased towards the majority class ( having the higher number of samples) and lose accuracy.

**Note:** It is advisable to keep the k values as odd to reduce the chances of getting a tie.

If we train the k-NN algorithm on unscaled data, there can be a case where different attributes lie in different scales. But while calculating distance, we don’t care about the scales, making our model biased. To avoid that, it is always advisable to standardize the attributes before applying the k-NN algorithm.

Source: Scikit-learn.org

The k-NN algorithm does not form any explicit Hypothesis function ( a mapping function that the machine learns). For a dataset in R², the hypothesis space is a polyhedron formed using the training samples.

**Voronoi Cell:** Suppose the training set is “T” and the elements of that training set is “x”**.** Then ** Voronoi Cell** of

Voronoi Cells cover the entire training space of T, and the union of these cells forms Voronoi Diagram.

- Examine the region where you think decision boundaries should occur. These are regions between two opposite classes.
- Identify pairs of oppositely labeled (+/-) classes and draw perpendicular bisectors of these oppositely labeled points.
- Extend all lines till they join and then draw the decision boundary.

Consider the example below, where two different classes labeled (+/-) are represented in the two-dimensional coordinate space. Points are:

**Step 1:** First, the points are plotted in the graph.

**Step 2:** Perpendicular bisectors between oppositely labeled points are drawn.

(A,C),(B,C),(D,C),(F,C), (B,E), (F,E)

**Step 3:** The boundaries are extended to form the Voronoi diagram. As the Voronoi Diagram represents 1-NN, the decision boundaries are sharp. Increasing 1 to higher values will smoothen the decision boundaries.

So far, we have discussed how we could use a k-NN algorithm to solve the classification tasks, but this machine learning algorithm can also solve regression problems. In such a case, we need to tweak the approach. Instead of counting the ** k-nearest** class labels, what if we average the data over

**Zero training time:**A very little training time is required compared to the other machine learning algorithms.**Sample efficiency:**There is no need to be a very high training sample.**Explainable:**At each step, the reason for the prediction can easily be depicted. Such explainability is really rare.**Easy to add and remove the data:**For other machine learning models, data addition requires retraining of the model. While in k-NN, we can directly update the memory and perform the inference.**Less sensitive to class imbalance:**Suppose we have 2 classes and one class has very higher instances in the dataset than others. k-NN, unlike other ML algorithms, is least affected by such class imbalances.

**Needs a lot of storage:**k-NN stores the whole training data in its memory and performs inference based on that.**Predictions are Slow:**The time complexity of k-NN is O(dN), where**d**is the dimension and**N**is the total number of samples. More the data more will be the prediction time.**The nearest neighbors can be fooled by irrelevant features.**

The dataset used to implement k-NN is the iris dataset which is imported from the datasets as load_iris. The ** iris dataset** is a standard dataset that is used for testing many algorithms. On the input side, it has four features, while the output label is among three classes. Other libraries are imported for training, preprocessing, and evaluation.

```
import matplotlib.pyplot as plt # update the plot
from sklearn import datasets# read the data
import numpy as np #for arrays
from sklearn.model_selection import train_test_split # split the data
from sklearn.preprocessing import StandardScaler # scale the data
from sklearn.neighbors import KNeighborsClassifier # the algorithm
from sklearn.metrics import accuracy_score #grade the results
import pandas as pd
iris = datasets.load_iris() # read the data
X = iris.data[:] # select the features to use
y = iris.target # select the classes
iris_dataframe = pd.DataFrame (data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
plt.figure(2)
grr = pd.plotting.scatter_matrix(iris_dataframe,
c=iris["target"],
figsize=(15, 15),
marker='o',
S=60,
alpha=.8)
plt.show(2)
```

This dataset has four variable measurements — ** sepal length, sepal width, petal length, and petal width**-describing iris plants of three types:

The entire dataset is initially split into the training and testing part using the train*test*split. A standard scaler is used in the next step, StandardScalar( ), to standardize the data (column-wise). When fit to a dataset, the function will transform the dataset to **mean μ = 0** and **standard deviation σ = 1.**

A dataset with having ** N** samples and

Thus every data is then updated as,

```
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25,random_state=0)
SC = StandardScaler()# create the standard scaler
sc.fit(X_train) # fit to the training data
x_train_std = sc.transform(X_train) # transform the training data
X_test_std = sc.transform(X_test) # same transformation on test data
```

The model is fitted into k-NN for different values of k ranging between 1 to the number of samples in the testing dataset. The metric “** Minkowski**” along with p = 2 represents the Euclidean distance in the R-space. The model will be fitted on different values of k and then is used to predict the output for a test sample size.

```
accuracyTest = {}; accuracy Train = {}
for k in range (len (y_test):
knn = KNeighborsClassifier(n_neighbors=k+1, p=2, metric='minkowski')
knn.fit(X_train_std,y_train)
y_pred = knn.predict(x_test_std)
y_train_pred = knn.predict(X_train_std)
if (k+1)%10==0:
print(10*'-')
print("For k = %s" %(k+1))
print('Number in test ', len(y_test))
print('Misclassified samples: %d' % (y_test != y_pred).sum())
accTrain = accuracy_score(y_train,y_train_pred)
acc = accuracy_score(y_test, y_pred)
accuracyTest[k+1] = acc
accuracyTrain[k+1] = accTrain
for accuracy in [accuracy Train, accuracy Test]:
lists = sorted(accuracy.items() # sorted by key, return a list of tuples
X, y = zip(*lists) # unpack a list of pairs into two tuples
plt.plot(x, y)
plt.show()
```

The two datasets (training and testing) are combined to show the effect of varying k in the k-NN algorithm. For visualization, only two features (petal length and petal width) are considered. The value of k taken is [1,25,50,75,100,112], where the training sample size is 112. The decision boundary at k = 112 returns the majority of the three classes (red)

```
X = iris.data[:, [2,3]] # select the features to use
y = iris.target # select the classes
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25,random_state=0)
SC = StandardScaler()# create the standard scaler
sc.fit(X_train) # fit to the training data
x_train_std = sc.transform(X_train) # transform the training data
X_test_std = sc.transform(X_test) # same transformation on test data
X_combined_std = np.vstack((X_train_std, X_test_std))
y_combined = np.hstack((y_train, y_test))
print('Number in combined ', len(y_combined))
# check results on combined data
y_combined_pred = knn.predict(X_combined_std)
print('Misclassified combined samples: %d' 1 % (y_combined != y combined_pred). sum )
print('Combined Accuracy: %.2f' % accuracy_score(y_combined, y_combined_pred))
# visualize the results
for k in [1,25,50, 100, len(X_train)]:
knn = KNeighborsClassifier (n_neighbors=k, p=2, metric='minkowski')
knn.fit(X_train_std, y_train)
plot_decision_regions(X=X_combined_std, y=y_combined, classifier=knn,
test_idx=range(105,150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.title('k=%s'%k)
plt.legend(loc='upper left')
plt.show()
```

As we stated, this algorithm brings a lot of explainability with itself. Interviewers can ask deeper questions on this topic. Some of them could be,

- How k-NN is different from other Machine Learning algorithms?
- Will changing the distance metric affect the classification accuracy?
- Is k-NN highly sensitive to data normalization?
- Why it is a non-parametric algorithm?
- What are the major cons of the k-NN algorithm?

In this article, we have covered the concept of the first “Machine Learning” algorithm, i.e., k-Nearest Neighbour. We saw how we can define the instances as neighbors and how the value of **k** affects the predictions. We also discussed why feature scaling played an important role and learned about the Voronoi Diagram using which k-NN learns. After that, we discussed the regression use-case of k-NN and the sort of advantages and disadvantages it brings. Finally, we implemented the k-NN algorithm on the famous Iris dataset. We hope you enjoyed the article.

- Scikit-learn: Machine Learning in Python, Pedregosa,
*et al.*, JMLR 12, pp. 2825–2830, 2011 - Mitchell, T. M. (1997). Machine learning., McGraw Hill series in computer science New York: McGraw-Hill.
- UCI Machine Learning Repository: Iris Data Set.
- J. D. Hunter, “Matplotlib: A 2D Graphics Environment”, Computing in Science & Engineering, vol. 9, no. 3, pp. 90–95, 2007.

Pathway For Attempting Machine Learning Projects

In Machine Learning solutions, we need to have the most coordination between technology and business verticals. For any Machine Learning project from business experts, there are mainly seven different verticals or phases it has to pass. All of these seven verticals are mentioned in the image above.

Supervised, Unsupervised, And Semi-Supervised Learning With Real-Life Usecase

Based on the nature of input that we provide to a machine learning algorithm, machine learning can be classified into 4 major categories - Supervised Learning, Unsupervised Learning, Semi-Supervised Learning, Reinforcement Learning.

Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops and math. enjoy learning!