Data understanding is very crucial in the field of Machine Learning. With the recent trend of collecting data, every organization has started collecting some form of it. But the concern is that most of these data are unutilized and considered digital junk. Labeling this vast amount of data with the help of annotation engineers or data engineers would be time-consuming and highly expensive. The quality of data is directly proportional to the cost it acquires. But we know that the budget is always limited.

If we can spend less on the workforce required to label the data, why not build algorithms that can provide us with information without manual intervention? In such a scenario, unsupervised learning algorithms come into existence. They help us extract meaningful information from the junk and present it in a smooth human-readable format without supervision. The absolute difference between Supervised and Unsupervised algorithms is explained in this blog.

- What is clustering in machine learning?
- What is meant by the K-means algorithm?
- What are the basic steps of K-means clustering?
- What are the limitations of K-means?
- What are the possible improvements in K-means?
- How to choose the value of K in K-means?
- What are some inherent problems with clustering algorithms?

K-means algorithm is a type of clustering algorithm; hence, before moving toward the depth of this algorithm, let's first understand the term "Clustering."

If we define this term formally: **Clustering is an unsupervised learning technique used to find the subgroups (also known as Clusters) in the dataset.**

Let's understand it in a layman's form. Suppose we have three flavored candies in a box, Salty, Sweet, and Chocolate. Earlier, all these candies were mixed in the box, and every candy had the same shape and size. Now suppose a kid asks us to give a chocolate candy. As the candies are mixed, this task will take time. We can separate these candies into three boxes to make this easier, each corresponding to one taste. This way of separating candies is known as clustering.

In the image below, data samples have been divided into three clusters, and each cluster has been assigned different colors. Yellow squares correspond to the 1st cluster, blue corresponds to the 2nd, and red corresponds to the 3rd.

The clustering technique is prevalent in many fields, and many algorithms exist to perform it. K-means is one of them.

James MacQueen first used the term "K-means" in 1967. It is an older technique but still very popular in the data science and machine learning industries. If we define the term formally,

K-means is a simple and elegant approach which is used to partition data samples into a pre-defined “

K“distinct and non-overlappingclusters.

The value of K in the K-means algorithm depends upon the user's choice. In the image above, the user has defined the value of K = 3. Every observation from the dataset will follow two fundamental properties,

**Each observation belongs to at least one of the K clusters.**In simple terms, there can't be any sample that is not a part of any of the clusters.**No observation will belong to more than one cluster.**In simple terms, one sample can not lie in two different clusters simultaneously.

Because of its **simplicity**, **relative robustness** (work with a wider variety of datasets), and "**good enough**" explanation, it is still considered one of the first algorithms data analysts use to investigate any new dataset.

Suppose we have some data samples (X1, X2, …, Xn), and we want to divide these data samples into "K" clusters. We can use the K-means algorithm to perform this work based on iterative logic. The complete implementation can be summarized in five steps.

Accept inputs of data and the number of clusters to group that data. Our case has n data samples, and we want to divide these samples into **k** clusters.

Initialize the first K clusters. There can be two ways to do that:

- Pick first K samples, or
- Take Random sampling of K elements from the available dataset.

These **K** samples will be treated as temporary centroids (also called mean).

Now we have **(n-k)** data samples left. Based on the proximity/distance to the K centroid values, each data sample from (n-k) will be assigned to only one of the K clusters. At this stage, all the samples have been assigned to some cluster. Each record is assigned to the nearest cluster using a measure of distance (e.g., Euclidean distance).

We need to calculate the new centroids for every cluster. But this time, the centroid calculation will not be random, but the samples present in the cluster will be averaged to calculate the mean. At this step, we will have **K** centroids that can be different from the earlier chosen random **K** samples**.**

We will iterate over this algorithm starting from Step 3 for the entire **n** samples. This will be repeated until we see no/negligible movement of samples among clusters.

In this way, K-means groups the data samples into the desired number of clusters. The steps are very straightforward and very intuitive. This is one of the most important reasons for the popularity of this algorithm.

K-means is a vital algorithm, but it has certain limitations as well. Some of these limitations are:

- Looking carefully at the steps mentioned above, this algorithm is computationally expensive. It requires time proportional to the product of the
**number of data items (n)**, the**number of clusters (k)**, and the**number of iterations (m).** - We randomly picked K clusters in the first step because K-means generally converge to a local optimum. The quality of the resulting clusters highly depends upon the selection of initial
**K**clusters. - There can be situations where this algorithm can suffer from an empty cluster problem. If we remember, we randomly picked K initial values for K centroids, and we may select an outlier here. In such conditions, the cluster might become empty.

As per the limitations discussed above, there are scopes where K-means can be improved further. The K-means algorithm computes the distance between data samples and the centroid within each cluster in each iteration. It makes this algorithm computationally very expansive in the case of massive datasets. However, we can utilize the knowledge of the distance calculated in the previous iteration to make it computationally cheaper. **Can you think how?**

In the first iteration, we calculated the distance for each data point to the nearest cluster and termed it as **previous_distance[i]** for the **ith** sample. At the next iteration, the centroids of all the clusters will change because samples will change their clusters. Here, we will first calculate the distance of the same **ith** sample to the previous cluster (**but with the updated centroid**) and termed it as **current_distance[i]**. Now we will compare it with the previous distance **previous_distance[i]**. There will be two scenarios:

**current**There will be no movement of the ith sample within clusters. We can interpret it with the situation that the updated centroid of the cluster came closer to the ith sample, and hence it became a more potential candidate for falling into that cluster.*distance[i] ≤ previous*distance[i]:**current**Sample will leave the previous cluster and move towards the other cluster.*distance[i] >previous*distance[i]:

With this comparison at the start, we will be saving the time required to compute the distances to **K-1** clusters, and it becomes slightly better in terms of computational complexity.

The decision of an optimal number of clusters is subjective and depends upon the methods to find the similarity based on which we performed the clustering. There are some algorithmic methods also that can help us in making decisions on K.

Let's discuss the pseudo-code using which we can implement the Elbow method.

- Run the K-means algorithm for different values of K; let's say K=1 to K=10.
- For every K, calculate the total sum of distances of samples to their closest cluster centroid. Let's say it SSE (Sum Squared Error). We say it is an error because, ideally, every sample should lie on the corresponding cluster's centroid, which is impossible.
- Plot the curve of SSE with respect to the number of clusters K.
- The value of
**K**at the bend (knee) in the plot is generally considered an indicator of the appropriate number of clusters.

```
SSE = {}
for k in range(1, 10):
k_means = KMeans(n_clusters=k, max_iter=1000).fit(data)
data["clusters"] = kmeans.labels_
SSE[k] = kmeans.inertia_
# Inertia: Sum of distances of samples to their closest cluster #center
plt.figure()
plt.plot(list(SSE.keys()), list(SSE.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
```

The optimal value of clusters is 3 for the data used in the above code. There is another famous method,

Let's learn the pseudo-code of this algorithm as well.

- Run the K-means algorithm for different values of K, let's say K=2 to K=11.
- For every K, calculate the average silhouette of observations. Let's say AvS.
- Plot the curve of AvS with respect to the number of clusters K.
- The value of
**K**at the maximum AvS will be considered optimal.

```
for n_cluster in range(2, 11):
kmeans = KMeans(n_clusters=n_cluster).fit(data)
label = kmeans.labels_
AvS = silhouette_score(data, label, metric='euclidean')
print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, AvS))
##############################################################
'''
For n_clusters=2, The Silhouette Coefficient is 0.680813620271
For n_clusters=3, The Silhouette Coefficient is 0.552591944521
For n_clusters=4, The Silhouette Coefficient is 0.496992849949
For n_clusters=5, The Silhouette Coefficient is 0.488517550854
For n_clusters=6, The Silhouette Coefficient is 0.370380309351
For n_clusters=7, The Silhouette Coefficient is 0.356303270516
For n_clusters=8, The Silhouette Coefficient is 0.365164535737
For n_clusters=9, The Silhouette Coefficient is 0.346583642095
For n_clusters=10, The Silhouette Coefficient is 0.328266088778
'''
###############################################################
```

Here, n = 2 has the highest coefficient, so ideally, it should be selected, but, based on the knowledge of the Iris data, we were aware that three species are there, so the next value of 3 is chosen for optimal. When dealing with higher dimensions, this method produces a better result. From this example, these methods can only help us decide the appropriate number of clusters, and we can not wholly rely on their predictions.

Let's try to implement this algorithm using the Scikit-Learn library on one of the famous datasets of the framework, i.e., the Iris Dataset shown in the image below.

**Step 1:** Importing the required libraries of pandas, Iris datasets, KMeans model, and matplotlib.

```
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
```

**Step 2:** Extract the data from the load_iris dataset and separate the target variables.

```
iris = load_iris()
X = iris.data
y = iris.target
```

**Step 3:** Fit the Scikit-learn K-means model

```
kmeans = KMeans(n_clusters=3, max_iter=1000).fit(X)
labels = kmeans.labels_
```

**Step 4:** Plot the scattered pred

```
plt.figure(figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
ax.scatter(X[:, 3], X[:, 0], X[:, 2],
c=labels.astype(float), edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.dist = 12
plt.show()
```

We can see that the 3 clusters are formed based on the Petal width, Petal Length, and Sepal Length. From the Average silhouette method, we received the best number of clusters 2, and we also noticed from the above image that the yellow and purple samples are very close. That's why the silhouette method needed clarification and suggested two clusters.

There is no doubt that clustering is one of the most important processes for data science and machine learning, but it has some practical challenges for which the solution is challenging to find. Some of these practical issues are

- Performing normalization or standardization before applying cluster sometimes produce better results and sometimes worse. So this decision is not firm and only be observed by actually implementing it. And the final clusters are highly dependent upon these steps.
- The choice of the number of clusters is not fixed and only can be observed by actually implementing it.
- The results of the clustering algorithms are hard to analyze in the case of higher dimensions. Dimensionality reduction methods like PCA can reduce the dimensions here, but it also causes data loss.

K-means is a popular algorithm widely used to analyze real-life data, including medical, sales, finance, and many more. Some of the most excellent applications of K-means are:

- This algorithm is used to construct the Yolo-v3,4,5 object detection models, which are famous in deep learning.
- Customer Segmentation is one of the most incredible examples which uses K-means directly. More details can be found here.
- Pattern recognition in images: We can feed millions and billions of pictures into this algorithm. K-means can segregate them into the desired number of clusters based on their content.

K-means is considered one of the fundamental algorithms that every interviewer expects a candidate knows about it. Data analysis and finding patterns in massive data is the most crucial step. Knowledge of k-means suggests that the data scientist position candidates have worked on unstructured data. Possible interview questions on this topic would be,

- What is the step-wise process in which K-means find clusters?
- How do we decide the number of clusters?
- How can we improve the speed of the native K-means algorithm?
- What are the inherent problems of clustering algorithms?
- What are the other methods used for clustering?

In this article, we discussed the K-means algorithm in a detailed fashion. We learned the pseudo-code of K-means algorithms, inherent problems associated with the K-means algorithm, and how we can improve the time performance of the K-means algorithm. We discussed methods to help find the optimal value K, one of the most frequent questions in machine learning interviews. After that, we implemented the algorithm on the Iris dataset.