Customer Segmentation using Hierarchical Clustering

What is Customer Segmentation?

Customer Segmentation helps businesses to develop customer-focused marketing strategies and tactics. Every customer has a different taste; therefore, one single approach is not applicable to all customers. Knowing customer behavior can help develop customer-targeted strategies and ultimately maximize sales. With the current advancements in Machine Learning and Artificial Intelligence, customer segmentation is not a challenging task anymore.

Customer Segmentation is splitting the organization's customer base into smaller groups that reflect some similarities in their behavior. The goal of segmenting customers is to make segment-wise decisions and maximize the value of each customer to the business.

What is Clustering?

Clustering is an unsupervised machine learning task that deals with dividing the data points into a number of groups based on feature similarities and finally allocating them into clusters. Ideally, the clusters should be distinct and non-overlapping. 

Clustering is prevalent in many fields, and hence, there exist several clustering algorithms, but for this session, we will keep ourselves limited to Hierarchical Clustering.  

Let's understand Customer Segmentation with an example

Suppose Walmart has collected the customer data based on their past transactions like the gender of the customer, age, annual income, spending score, shopped item category, etc. With the availability of all these parameters, Walmart's marketing team has sufficient information to explain customers' spending habits. 

Now imagine, Walmart is launching a campaign targeting customers interested in luxurious items. In order to attract them to the store, they have special offers, but extending the offer to all the customers will not make sense since not all customers are interested in luxurious items. We can segregate similar users who can buy luxury items using customer data to address this problem. The problem can be solved using clustering algorithms, where we can cluster the customers into different segments and select the cluster of potential customers for the campaign.

In this blog, we want to explore the potential of clustering algorithms to accomplish the above task. Let's start the clustering analysis:

Clustering Analysis

Our goal is to divide customers into different groups such that customers falling into the same group will be sharing similarities in terms of their spending habits. We will be using Kaggle's Mall Customer Segmentation dataset for the implementation. It's a small dataset having 200 customers' data and spending scores. 

import pandas as pd
customer_data = pd.read_csv('Mall_Customers.csv')

Customer Data

Data snippet of kaggle

We have four columns of great concern and one Customer ID column. Each row in this dataset represents a customer. We have the age, gender, customer's annual income, and spending score for each customer. 

Hierarchical Clustering

Hierarchical Clustering is an unsupervised machine learning algorithm that groups similar objects into groups called clusters. The outcome of this algorithm is a set of clusters where data points of the same cluster share similarities. Furthermore, the clustering can be interpreted using a dendrogram. 

Dendrogram formation

Hierarchical Clustering has two variants:

  • Agglomerative hierarchical clustering
  • Divisive Hierarchical clustering

Agglomerative vs Divisive

Agglomerative hierarchical clustering

This is the commonly used approach where we start with individual clusters for each data point and merge the clusters in a ranked manner based on their closeness. The final cluster contains all the data points. 

Divisive Hierarchical clustering

Divisive is quite the opposite of the Agglomerative approach. Here, we start with a single cluster containing all the data points. At each step, we split the most distant data point in the cluster, and this process is repeated till we arrive at the individual data points.  

Let's stick with the agglomerative approach since it is commonly used across industries. Now, you must be wondering: what are the criteria for calculating the nearness or closeness of two clusters? This is where the Proximity Matrix comes into the picture. Let's understand the concept of Proximity Matrix using a small example:

Imagine we have monthly salaries of 5 working professionals, and we want to group the similar professionals based on their monthly salary. Here, we will use hierarchical clustering to group similar professionals together.  

Salaries of customers

Firstly, we need to create a proximity matrix which is nothing but a square matrix of n x n dimensions containing the Euclidean Distances. If you are not aware of Euclidean distance, here's a quick link. The Euclidean distance formula is defined as follows:

Eucledian Distance

Let's take a look at the proximity matrix:

Proximity matrix 1

Proximity Matrix is also known as the Euclidean Distance Matrix

The distance between self is zero, and hence, the diagonals of the proximity matrix are always zero. We will merge the professionals with the lowest distance in the proximity matrix for pairing similar professionals. The minimum distance is 500 as per the above proximity matrix, not to be confused with zero, the distance with self. So, professionals 1 & 5 can be paired into one cluster, and professionals 4 & 3 can be paired into another cluster since the Euclidean distance is Lowest. 

Clusters formed: (1, 5) and (3, 4)

We will repeat the above process in the next iteration by constructing another proximity matrix. Before that, let's update the salary table.

Clsuter formation

Note: After making a cluster, we need to specify the criteria for assigning a salary to the so formed cluster. Here, we are using the maximum function. Some other options are minimum and average. 

Let's build the proximity matrix:

Proximity matrix 2

Here, we can see (3, 4) cluster shares similarities with professional two since their distance is lowest. Time to group them to update the clusters. 

Clusters Formed: (1, 5) and (2, 3, 4)

Now, we are left with two clusters only, and this time, whatever distance value comes up in the proximity matrix, these clusters are bound to merge. The final cluster will contain all the professionals. 

Let's finally conclude the clustering formed in this case using a Dendrogram.


A dendrogram is a visual representation of the hierarchical relationship between clusters. The X-axis contains the samples, also known as the leaf nodes, and Y-axis includes the distance between the two clusters or samples. The longer the distance between two samples, the less similar they are, and the converse is true. Following is a rough dendrogram to the above example:


To choose the number of clusters, we locate the most extensive vertical line, and we usually draw a horizontal line cutting the most extensive vertical line. The y-axis intersection value is the threshold for finalizing the optimal number of clusters. This way is inefficient and depends on the distance between two clusters. 

Dendrogram cutting

Number of optimal clusters for data = 2

Implementation of Hierarchical Clustering

Let's apply Hierarchical Clustering to our Customer Segmentation problem. Just before that, we only need to keep the relevant columns. 

features = customer_data.iloc[:, 1:4]

Data snippet 2


import scipy.cluster.hierarchy as sch

dendrogram = sch.dendrogram(sch.linkage(features, method = 'ward'))
plt.title('Dendrogam', fontsize = 20)
plt.ylabel('Ecuclidean Distance')

Dendrogram visualization on data 1

Three seems a suitable threshold as per the rule for cutting the most extensive vertical line for finalizing the optimal number of clusters. 

dend = shc.dendrogram(shc.linkage(features, method='ward'))
plt.axhline(y=300, color='r', linestyle='--')

Dendrogram visualization on data 2

The number of optimal clusters is 3. Let's visualize the clusters!

from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')  

plt.figure(figsize=(10, 7))  
plt.scatter(features['Annual Income (k$)'], features['Spending Score (1-100)'], c=cluster.labels_)
plt.xlabel("Annual Income", fontsize = 15)
plt.ylabel("Spending Score", fontsize = 15)

Clusters = 3 formation

Clusters at n= 3

The number of optimal clusters formed according to the rule is misleading in this analysis. Better rely on intuition and domain knowledge for deciding the optimal number of clusters. In the above plot, we can confirm the presence of 4 clusters. Let's set the optimal clusters as five and visualize the position of clusters.  

Clusters = 4 formation

Clusters at n = 4

This is an excellent improvement in the visualization of clusters. Sometimes the rule returns misleading information, but we should always confirm the results based on our intuition and domain knowledge. Back to the problem statement, so after analyzing the above scatter plot, we can say that we are interested in the red-colored cluster because these customers have a high annual income along with a high spending score, and that's precisely what we are looking for. Now, we can hand over these customers to the marketing team. They will approach these customers with some planned offers, and since they belong to a common cluster, we can confidently say that most of the customers will respond in a similar way to the offers. This ends the marketing campaign for us. 

Advantages of Hierarchical Clustering

  • Easy to implement and gives good results in some cases.
  • Unlike K-means, no need to define K clusters in the beginning. 

Disadvantages of Hierarchical Clustering

  • The execution time is pretty high. 
  • Hard to determine the accurate number of clusters using the dendrogram. 

Using hierarchical clustering over a Million Row Dataset is out of the question due to its high time complexity. Hence, it is not preferred for large to medium-range datasets. 


We started with a brief introduction to the customer segmentation problem statement and answered: why it is necessary to analyze the customer's spending habits. Further, we briefly discussed the working of Hierarchical clustering using a small example. We also spent some time understanding the proximity matrix and dendrograms. Finally, we implemented the Hierarchical clustering to our Customer Segmentation datasets and visualized the results using the Dendrogram and Scatterplot. We concluded the implementation followed by the pros and cons of the Hierarchical Clustering algorithm.

Enjoy learning, Enjoy algorithms!

Share feedback with us

More blogs to explore

Our weekly newsletter

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.

© 2022 Code Algorithms Pvt. Ltd.

All rights reserved.