In today’s world, most data samples are unlabelled as it needs human resources and additional cost. To tackle this challenge, ML engineers developed unsupervised learning techniques to remove the dependency on data labeling. One of the methods in unsupervised learning techniques is clustering, where machines automatically form groups of data samples based on their similarities.

The K-means algorithm is one of the most widely used clustering algorithms in machine learning. It separates data samples into K distinct clusters, and we will learn the procedure through which it does the same.

After successfully going through this blog, we will be able to understand the following things:

- A basic introduction to clustering and how it works.
- Theory of K-means algorithm, step-wise implementation, limitations, and possible improvements.
- How to choose the value of K?
- Possible interview questions on the k-means algorithm.

Let’s start with clustering, as K-means is an algorithm to perform clustering.

Clustering is an unsupervised learning technique used to find the subgroups (also known as Clusters) in the dataset. Let’s understand this idea with an analogy — suppose you have a box of mixed candies. Some candies are salty, some are sweet, and some are chocolate.

Now, if someone asks you to give them chocolate candy, finding it in the mix would be difficult. But what if we have already separated the candies into different boxes based on flavor? This would make it much easier to find chocolate candy when someone asks for it.

Similarly, clustering is a technique in which we group similar data points in a dataset based on certain criteria, such as their shapes, size, taste, or color. The image below shows an example of clustering — the data samples have been divided into three clusters and assigned different colors to represent each cluster. Yellow squares correspond to the 1st cluster, blue to the 2nd, and red to the 3rd.

The clustering technique is prevalent in many fields, and many algorithms exist to perform it. K-means is one of them, which is the main topic of this blog. So let’s discuss this in greater detail.

K-means is a simple and elegant approach to partitioning data samples into pre-defined “K” distinct and non-overlapping clusters. The value of **K** in the K-means algorithm depends on the user’s choice. If we use the K-means algorithm to segregate data samples into three categories, we need to mention the value of k = 3 in our code.

James MacQueen first used the term “K-means” in 1967. It is an older technique but still very popular in the data science and machine learning industries.

**In the K-means algorithm, every data sample from the dataset will follow two fundamental properties:**

- Each data sample belongs to at least one of the ‘K’ clusters. In simple terms, there can’t be any sample that is not a part of any of the clusters.
- No data sample will belong to more than one cluster. In simple terms, one sample cannot be present in two (or more) clusters at the same time.

Because of its simplicity, relative robustness (it can work with a wider variety of datasets), and “good enough” explanation, it is still considered one of the first algorithms data analysts use to investigate any new dataset. Explainability is one of its key characteristics and makes it even more popular.

Suppose we have some data samples (X1, X2, …, Xn), and we want to divide these data samples into “K” clusters. We can use the K-means algorithm to perform this task based on iterative logic. The complete implementation can be summarized in five steps:

Accept data inputs and pre-define the number of groups we want to form. In our case, we have “n” data samples, and we want to divide these samples into K clusters.

K-means is an algorithm that inherently uses the statistical means of samples present in a cluster. But that would be possible when we have already formed the group. To start from somewhere, we need some values for these means, which will later be updated when the algorithm progresses.

Initialize the first K means for K clusters. There can be two ways to do this:

- Pick the first K samples, or
- Randomly pick K elements from all the available data samples.

These **K** samples will be treated as temporary centroids (also called the mean).

Now, we have **(n — k)** data samples left. For each sample, we need to calculate its distance from all the K centroids, and later it will be assigned to only one of the K clusters for which distance would be minimum. We can use any of the distance measures to calculate this distance. Some of the popular ones are shown in the image below.

**What will happen if the distances from the two clusters are equal and minimum? (Think!)** In that case, we can pick any one cluster and put that sample in that.

All the samples have been assigned to some clusters out of K clusters at this stage.

After all the data points have been assigned to a cluster, we need to calculate the new centroids for every cluster. But this time, the centroid calculation will not be random. The new centroid is calculated as the mean of all the data points assigned to that cluster. At this step, we will have **K** centroids that can differ from the earlier chosen random **K** samples.

Please note that we are not saying all the K centroids will differ. There can be a possibility that no other data sample was assigned to one cluster. In that case, the centroid for that cluster will not update.

We will repeat steps 3 and 4 until the entire **n** samples converge. By convergence, we mean that there will be no or negligible movement of samples among clusters.

In this way, K-means groups the data samples into the desired number of clusters. The steps are straightforward and intuitive, one of the most important reasons for the algorithm’s popularity.

```
data_samples = [x1, x2, x3, ..., xn]
initialize_k_means = [x1, x2, ..., xk]
for all (n-k) sample:
track_minimum_distance
for all K selected means:
calculate distance of samples from all the selected K means.
assign sample to the cluster with which distance is minimum.
for all K means:
calculate the updated mean values.
Repeat above loops until no update in centroid happens.
Output K clusters.
```

K-means is the most popular algorithm for clustering data points. But it has certain limitations:

- K-means algorithm is computationally expensive. It requires significant computational resources and time to process large amounts of data. Here the time required is proportional to the number of data items (n), the number of clusters (k), and the number of iterations (m).
- The quality of the resulting clusters highly depends on the initial selection of K clusters. The algorithm randomly picks K clusters in the first step, which can lead to suboptimal results if the initial clusters are not chosen carefully.
- There can be situations where the K-means algorithm suffers from an empty cluster problem. This can happen when the initial K centroids are selected randomly, and one of them happens to be an outlier. In such cases, the cluster associated with that centroid might become empty.

As per the limitations discussed above, there are scopes where K-means can be improved further. The K-means algorithm computes the distance between data samples and the centroid within each cluster in each iteration. This makes the algorithm computationally very expensive in the case of massive datasets. However, we can utilize the knowledge of the distance calculated in the previous iteration to make it computationally cheaper. **Can you think of how?**

In the first iteration, we calculated the distance for each data point to the nearest cluster. Let’s say it is **previous_distance[i]** for the **ith** sample. In the next iteration, the centroids of all the groups will change because samples will be changed in the clusters.

Here, we will first calculate the distance of the same **i-th** sample to the previous cluster (but with the updated centroid) and term it as **current_distance[i]**. We will then compare it with the **previous_distance[i]**.

There are two scenarios:

**current**There will be no movement of the ith sample within clusters as distance become less. We can interpret this situation as the updated centroid of the cluster coming closer to the ith sample, making it a more potential candidate for falling into that cluster.*distance[i] ≤ previous*distance[i]:**current**The sample will leave the previous cluster and move towards the other cluster.*distance[i] > previous*distance[i]:

By making this comparison at the start, we will save the time required to compute the distances to **K-1** clusters, making it slightly better in computational complexity.

K-means is an unsupervised learning algorithm; hence, we do not have any data labels. **How can we pre-decide the number of clusters?** The decision of the optimal number of clusters is subjective and depends on the methods used to find the similarity based on which we perform the clustering. Some algorithmic approaches can help us decide the value for K, but they only provide a possible number of clusters. It’s upon us how many clusters we want.

Let’s discuss the pseudocode we can use to implement the Elbow method:

- Run the K-means algorithm for different values of K; let’s say K = 1 to K = 10.
- For every K, calculate the total sum of distances of samples to their closest cluster centroid. Let’s call it SSE (Sum Squared Error). It is an error because, ideally, every sample should lie on the corresponding cluster’s centroid, which is impossible.
- Plot the curve of SSE with respect to the number of clusters K. One can observe the knee-shaped structure formed.
- The value of
**K**at the bend (knee) in the plot is generally considered an indicator of the appropriate number of clusters.

```
SSE = {}
for k in range(1, 10):
k_means = KMeans(n_clusters=k, max_iter=1000).fit(data)
data["clusters"] = kmeans.labels_
SSE[k] = kmeans.inertia_
# Inertia: Sum of distances of samples to their closest cluster #center
plt.figure()
plt.plot(list(SSE.keys()), list(SSE.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
```

The optimal number of clusters for the data used in the above code is 3. Let’s learn one another famous method:

The Silhouette Coefficient (or Silhouette Score) is a metric used to evaluate the performance of a clustering algorithm. It is defined for every sample present in the dataset based on two calculations:

\

- a: Mean distance of a sample "x" from all other samples inside the same cluster.\
- b: Mean distance of the same sample "x" from all samples inside a cluster nearest to the cluster in which "x" is present.

The value of the Silhouette Coefficient is bounded in the range of [-1, 1].

1: This means clusters are dense and well apart from each other. We can clearly distinguish every cluster.

0: The distance between clusters is minimal, or we can say that clusters overlap.

-1: This is an indication of the wrong clustering.

**The pseudocode of this algorithm:**

- Run the K-means algorithm for different values of K; let’s say K = 2 to K = 11.
- For every K, calculate the average silhouette score of observations. Let’s call it AvS.
- Plot the curve of AvS with respect to the number of clusters K.
- The value of
**K**at the maximum AvS will be considered optimal.

```
for n_cluster in range(2, 11):
kmeans = KMeans(n_clusters=n_cluster).fit(data)
label = kmeans.labels_
AvS = silhouette_score(data, label, metric='euclidean')
print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, AvS))
##############################################################
'''
For n_clusters=2, The Silhouette Coefficient is 0.680813620271
For n_clusters=3, The Silhouette Coefficient is 0.552591944521
For n_clusters=4, The Silhouette Coefficient is 0.496992849949
For n_clusters=5, The Silhouette Coefficient is 0.488517550854
For n_clusters=6, The Silhouette Coefficient is 0.370380309351
For n_clusters=7, The Silhouette Coefficient is 0.356303270516
For n_clusters=8, The Silhouette Coefficient is 0.365164535737
For n_clusters=9, The Silhouette Coefficient is 0.346583642095
For n_clusters=10, The Silhouette Coefficient is 0.328266088778
'''
###############################################################
```

Here, n = 2 has the highest coefficient, so ideally, it should be selected. However, based on the knowledge of the Iris data, we were aware that there are three species, so the next value of 3 was chosen as optimal. When dealing with higher dimensions, this method produces better results.

This example shows that these methods can only help us decide the appropriate number of clusters, and we cannot rely entirely on their suggestions.

Let’s try to implement this algorithm using the Scikit-Learn library on one of the famous datasets of the framework, i.e., the Iris Dataset shown in the image below.

**Step 1:** Importing the required libraries of pandas, Iris datasets, KMeans model, and matplotlib.

```
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
```

**Step 2:** Extract the data from the load_iris dataset and separate the target variables.

```
iris = load_iris()
X = iris.data
y = iris.target
```

**Step 3:** Fit the Scikit-learn K-means model.

```
kmeans = KMeans(n_clusters=3, max_iter=1000).fit(X)
labels = kmeans.labels_
```

**Step 4:** Plot the scattered pred.

```
plt.figure(figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
ax.scatter(X[:, 3], X[:, 0], X[:, 2],
c=labels.astype(float), edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.dist = 12
plt.show()
```

We can see that the 3 clusters are formed based on the Petal width, Petal Length, and Sepal Length. From the Average silhouette method, we received the best number of clusters 2, and we also noticed from the above image that the yellow and purple samples are very close. That’s why the silhouette method needed clarification and suggested two clusters.

There is no doubt that clustering is one of the most important processes for data science and machine learning, but it has some practical challenges for which the solution is challenging. Some of these practical issues are:

- Performing normalization or standardization before applying to cluster sometimes produces better results and sometimes worse. So this decision needs to be firmer and can only be observed by actually implementing it. The final clusters are highly dependent on these steps.
- The choice of the number of clusters is not fixed and can only be observed by actually implementing it.
- The results of clustering algorithms are hard to analyze in the case of higher dimensions. Dimensionality reduction methods like PCA can reduce the dimensions but also cause data loss.

K-means is a popular algorithm widely used to analyze real-life data, including medical, sales, finance, and many more. Some of the most excellent applications of K-means are:

- This algorithm is used to construct the Yolo-v3,4,5 object detection models, which are famous in deep learning.
- Customer Segmentation is one of the most incredible examples which uses K-means directly. More details can be found here.
- Pattern recognition in images: We can feed millions and billions of pictures into this algorithm. K-means can segregate them into the desired number of clusters based on their content.

K-means is considered one of the fundamental algorithms that every interviewer expects a candidate knows about it. Data analysis and finding patterns in massive data is the most crucial step. Possible interview questions on K-means would be,

- What is the step-wise process in which K-means find clusters?
- How do we decide the number of clusters?
- How can we improve the speed of the native K-means algorithm?
- What are the inherent problems of clustering algorithms?
- What are the other methods used for clustering?

In this blog, we have discussed the K-means clustering algorithm in detail. We have learned the pseudocode of the K-means algorithm, its inherent problems, and methods to improve the time performance of the algorithm. K-means requires a pre-decided value for K which is tough to decide as data is not labeled. We also discussed methods to help find the optimal value of K via algorithmic approaches.

If you have any queries/doubts/feedback, please write us at contact@enjoyalgorithms.com. Enjoy learning, Enjoy algorithms!