Data understanding is very crucial in the field of Machine Learning. With the recent trend of collecting data, every organization has started collecting some form of it. But the concern is that most of these data are unutilized and considered digital junk. Labeling this vast amount of data with the help of annotation engineers or data engineers would be time-consuming and highly expensive. The quality of data is directly proportional to the cost it acquires. But we know that the budget is always limited.
If we can spend less on the workforce required to label the data, why not build algorithms that can provide us with information without manual intervention? In such a scenario, unsupervised learning algorithms come into existence. They help us extract meaningful information from the junk and present it in a smooth human-readable format without supervision. The absolute difference between Supervised and Unsupervised algorithms is explained in this blog.
K-means algorithm is a type of clustering algorithm; hence, before moving toward the depth of this algorithm, let's first understand the term "Clustering."
If we define this term formally: Clustering is an unsupervised learning technique used to find the subgroups (also known as Clusters) in the dataset.
Let's understand it in a layman's form. Suppose we have three flavored candies in a box, Salty, Sweet, and Chocolate. Earlier, all these candies were mixed in the box, and every candy had the same shape and size. Now suppose a kid asks us to give a chocolate candy. As the candies are mixed, this task will take time. We can separate these candies into three boxes to make this easier, each corresponding to one taste. This way of separating candies is known as clustering.
In the image below, data samples have been divided into three clusters, and each cluster has been assigned different colors. Yellow squares correspond to the 1st cluster, blue corresponds to the 2nd, and red corresponds to the 3rd.
The clustering technique is prevalent in many fields, and many algorithms exist to perform it. K-means is one of them.
James MacQueen first used the term "K-means" in 1967. It is an older technique but still very popular in the data science and machine learning industries. If we define the term formally,
K-means is a simple and elegant approach which is used to partition data samples into a pre-defined “K“ distinct and non-overlapping clusters.
The value of K in the K-means algorithm depends upon the user's choice. In the image above, the user has defined the value of K = 3. Every observation from the dataset will follow two fundamental properties,
Because of its simplicity, relative robustness (work with a wider variety of datasets), and "good enough" explanation, it is still considered one of the first algorithms data analysts use to investigate any new dataset.
Suppose we have some data samples (X1, X2, …, Xn), and we want to divide these data samples into "K" clusters. We can use the K-means algorithm to perform this work based on iterative logic. The complete implementation can be summarized in five steps.
Accept inputs of data and the number of clusters to group that data. Our case has n data samples, and we want to divide these samples into k clusters.
Initialize the first K clusters. There can be two ways to do that:
These K samples will be treated as temporary centroids (also called mean).
Now we have (n-k) data samples left. Based on the proximity/distance to the K centroid values, each data sample from (n-k) will be assigned to only one of the K clusters. At this stage, all the samples have been assigned to some cluster. Each record is assigned to the nearest cluster using a measure of distance (e.g., Euclidean distance).
We need to calculate the new centroids for every cluster. But this time, the centroid calculation will not be random, but the samples present in the cluster will be averaged to calculate the mean. At this step, we will have K centroids that can be different from the earlier chosen random K samples.
We will iterate over this algorithm starting from Step 3 for the entire n samples. This will be repeated until we see no/negligible movement of samples among clusters.
In this way, K-means groups the data samples into the desired number of clusters. The steps are very straightforward and very intuitive. This is one of the most important reasons for the popularity of this algorithm.
K-means is a vital algorithm, but it has certain limitations as well. Some of these limitations are:
As per the limitations discussed above, there are scopes where K-means can be improved further. The K-means algorithm computes the distance between data samples and the centroid within each cluster in each iteration. It makes this algorithm computationally very expansive in the case of massive datasets. However, we can utilize the knowledge of the distance calculated in the previous iteration to make it computationally cheaper. Can you think how?
In the first iteration, we calculated the distance for each data point to the nearest cluster and termed it as previous_distance[i] for the ith sample. At the next iteration, the centroids of all the clusters will change because samples will change their clusters. Here, we will first calculate the distance of the same ith sample to the previous cluster (but with the updated centroid) and termed it as current_distance[i]. Now we will compare it with the previous distance previous_distance[i]. There will be two scenarios:
With this comparison at the start, we will be saving the time required to compute the distances to K-1 clusters, and it becomes slightly better in terms of computational complexity.
The decision of an optimal number of clusters is subjective and depends upon the methods to find the similarity based on which we performed the clustering. There are some algorithmic methods also that can help us in making decisions on K.
Let's discuss the pseudo-code using which we can implement the Elbow method.
SSE = {}
for k in range(1, 10):
k_means = KMeans(n_clusters=k, max_iter=1000).fit(data)
data["clusters"] = kmeans.labels_
SSE[k] = kmeans.inertia_
# Inertia: Sum of distances of samples to their closest cluster #center
plt.figure()
plt.plot(list(SSE.keys()), list(SSE.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
The optimal value of clusters is 3 for the data used in the above code. There is another famous method,
Let's learn the pseudo-code of this algorithm as well.
for n_cluster in range(2, 11):
kmeans = KMeans(n_clusters=n_cluster).fit(data)
label = kmeans.labels_
AvS = silhouette_score(data, label, metric='euclidean')
print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, AvS))
##############################################################
'''
For n_clusters=2, The Silhouette Coefficient is 0.680813620271
For n_clusters=3, The Silhouette Coefficient is 0.552591944521
For n_clusters=4, The Silhouette Coefficient is 0.496992849949
For n_clusters=5, The Silhouette Coefficient is 0.488517550854
For n_clusters=6, The Silhouette Coefficient is 0.370380309351
For n_clusters=7, The Silhouette Coefficient is 0.356303270516
For n_clusters=8, The Silhouette Coefficient is 0.365164535737
For n_clusters=9, The Silhouette Coefficient is 0.346583642095
For n_clusters=10, The Silhouette Coefficient is 0.328266088778
'''
###############################################################
Here, n = 2 has the highest coefficient, so ideally, it should be selected, but, based on the knowledge of the Iris data, we were aware that three species are there, so the next value of 3 is chosen for optimal. When dealing with higher dimensions, this method produces a better result. From this example, these methods can only help us decide the appropriate number of clusters, and we can not wholly rely on their predictions.
Let's try to implement this algorithm using the Scikit-Learn library on one of the famous datasets of the framework, i.e., the Iris Dataset shown in the image below.
Step 1: Importing the required libraries of pandas, Iris datasets, KMeans model, and matplotlib.
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
Step 2: Extract the data from the load_iris dataset and separate the target variables.
iris = load_iris()
X = iris.data
y = iris.target
Step 3: Fit the Scikit-learn K-means model
kmeans = KMeans(n_clusters=3, max_iter=1000).fit(X)
labels = kmeans.labels_
Step 4: Plot the scattered pred
plt.figure(figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
ax.scatter(X[:, 3], X[:, 0], X[:, 2],
c=labels.astype(float), edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.dist = 12
plt.show()
We can see that the 3 clusters are formed based on the Petal width, Petal Length, and Sepal Length. From the Average silhouette method, we received the best number of clusters 2, and we also noticed from the above image that the yellow and purple samples are very close. That's why the silhouette method needed clarification and suggested two clusters.
There is no doubt that clustering is one of the most important processes for data science and machine learning, but it has some practical challenges for which the solution is challenging to find. Some of these practical issues are
K-means is a popular algorithm widely used to analyze real-life data, including medical, sales, finance, and many more. Some of the most excellent applications of K-means are:
K-means is considered one of the fundamental algorithms that every interviewer expects a candidate knows about it. Data analysis and finding patterns in massive data is the most crucial step. Knowledge of k-means suggests that the data scientist position candidates have worked on unstructured data. Possible interview questions on this topic would be,
In this article, we discussed the K-means algorithm in a detailed fashion. We learned the pseudo-code of K-means algorithms, inherent problems associated with the K-means algorithm, and how we can improve the time performance of the K-means algorithm. We discussed methods to help find the optimal value K, one of the most frequent questions in machine learning interviews. After that, we implemented the algorithm on the Iris dataset.