Customer Segmentation helps businesses to develop customer-focused marketing strategies and tactics. Every customer has a different taste; therefore, one approach only applies to some customers. Knowing customer behavior can help create customer-targeted strategies and ultimately maximize sales. With the current advancements in Machine Learning and Artificial Intelligence, customer segmentation is not a challenging task anymore.

Customer segmentation splits the organization's customer base into smaller groups that reflect similar behavior. The goal of segmenting customers is to make segment-wise decisions and maximize the value of each customer to the business.

Clustering is an unsupervised machine learning task that divides the data points into groups based on feature similarities and allocates them into clusters. Ideally, the clusters should be distinct and non-overlapping.

Clustering is prevalent in many fields, and hence, there exist several clustering algorithms, but for this session, we will keep ourselves limited to Hierarchical Clustering.

Suppose Walmart has collected customer data based on past transactions like the gender of the customer, age, annual income, spending score, shopped item category, etc. With the availability of all these parameters, Walmart's marketing team has sufficient information to explain customers' spending habits.

Now imagine Walmart is launching a campaign targeting customers interested in luxurious items. They have special offers to attract them to the store, but extending them to all the customers will not make sense since not all customers are interested in luxurious items. We can segregate similar users who can buy luxury items using customer data to address this problem. The problem can be solved using clustering algorithms, where we can cluster the customers into different segments and select the cluster of potential customers for the campaign.

In this blog, we want to explore the potential of clustering algorithms to accomplish the above task. Let's start the clustering analysis:

Our goal is to divide customers into different groups such that customers falling into the same group will share similarities in terms of their spending habits. We will be using Kaggle's Mall Customer Segmentation dataset for the implementation. It's a small dataset having 200 customers' data and spending scores.

```
import pandas as pd
customer_data = pd.read_csv('Mall_Customers.csv')
customer_data.head()
#####################################################
'''
customerID | Gender | Age | Annual Income (k$) | Spending Score
0 1 | Male | 19 | 15 | 39
1 2 | Male | 21 | 15 | 81
2 3 | Female | 20 | 16 | 6
3 4 | Female | 23 | 16 | 77
4 5 | Female | 31 | 17 | 40
'''
#######################################################
```

We have four columns of great concern and one Customer ID column. Each row in this dataset represents a customer. We have the age, gender, customer's annual income, and spending score for each customer.

Hierarchical Clustering is an unsupervised machine-learning algorithm that groups similar objects into groups called clusters. The outcome of this algorithm is a set of clusters where data points of the same cluster share similarities. Furthermore, the Clustering can be interpreted using a dendrogram.

**Hierarchical Clustering has two variants:**

- Agglomerative hierarchical clustering
- Divisive Hierarchical clustering

This is the commonly used approach where we start with individual clusters for each data point and merge them ranked based on their closeness. The final cluster contains all the data points.

Divisive is quite the opposite of the Agglomerative approach. Here, we start with a single cluster containing all the data points. At each step, we split the most distant data point in the cluster, and this process is repeated till we arrive at the individual data points.

Let's stick with the agglomerative approach since it is commonly used across industries. What are the criteria for calculating the nearness or closeness of two clusters? This is where the **Proximity Matrix** comes into the picture. Let's understand the concept of the Proximity Matrix using a small example:

Imagine we have the monthly salaries of 5 working professionals, and we want to group similar professionals based on their monthly salaries. Here, we will use hierarchical Clustering to group similar professionals together.

```
salary id | salary
1 | 2000$
2 | 5500$
3 | 4000$
4 | 4500$
6 | 2500$
```

Firstly, we need to create a proximity matrix, which is a square matrix of n x n dimensions containing the Euclidean Distances. If you need to learn about Euclidean distance, here's a quick link. The Euclidean distance formula is defined as follows:

Let's take a look at the proximity matrix:

Proximity Matrix is also known as the

Euclidean Distance Matrix

The distance between self is zero; hence, the proximity matrix's diagonals are always zero. We will merge the professionals with the lowest distance in the proximity matrix for pairing similar professionals. The minimum distance is 500 per the above proximity matrix, not to be confused with zero, the distance with self. So, professionals 1 & 5 can be paired into one cluster, and professionals 4 & 3 can be paired into another cluster since the Euclidean distance is Lowest.

Clusters formed: (1, 5) and (3, 4)

We will repeat the above process in the next iteration by constructing another proximity matrix. Before that, let's update the salary table.

```
Clusters | salary
(1,5) | 2500$
2 | 5500$
(3,4) | 4500$
```

**Note:** After making a cluster, we need to specify the criteria for assigning a salary to the so-formed cluster. Here, we are using the maximum function. Some other options are minimum and average.

Let's build the proximity matrix:

Here, we can see (3, 4) cluster shares similarities with professional two since their distance is lowest. Time to group them to update the clusters.

`Clusters Formed: (1, 5) and (2, 3, 4)`

Now, we are left with only two clusters; this time, whatever distance value comes up in the proximity matrix, these clusters are bound to merge. The final cluster will contain all the professionals.

Let's finally conclude the Clustering formed in this case using a Dendrogram.

A dendrogram is a visual representation of the hierarchical relationship between clusters. The X-axis contains the samples, also known as the leaf nodes, and Y-axis includes the distance between the two clusters or samples. The longer the distance between two samples, the less similar they are, and the converse is true. Following is a rough dendrogram of the above example:

To choose the number of clusters, we locate the most extensive vertical line, and we usually draw a horizontal line cutting the most extensive vertical line. The y-axis intersection value is the threshold for finalizing the optimal number of clusters. This way is inefficient and depends on the distance between two clusters.

Number of optimal clusters for data = 2

Let's apply Hierarchical Clustering to our Customer Segmentation problem. Just before that, we only need to keep the relevant columns.

```
features = customer_data.iloc[:, 1:4]
features.head()
######################################
'''
Age | Annual Income (k$) | Spending Score
0 19 | 15 | 39
1 21 | 15 | 81
2 20 | 16 | 6
3 23 | 16 | 77
4 31 | 17 | 40
'''
#####################################
```

Features

```
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(features, method = 'ward'))
plt.title('Dendrogam', fontsize = 20)
plt.xlabel('Customers')
plt.ylabel('Ecuclidean Distance')
plt.show()
```

Three seems a suitable threshold as per the rule for cutting the most extensive vertical line for finalizing the optimal number of clusters.

```
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(features, method='ward'))
plt.axhline(y=300, color='r', linestyle='--')
```

The number of optimal clusters is 3. Let's visualize the clusters!

```
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
cluster.fit_predict(features)
plt.figure(figsize=(10, 7))
plt.scatter(features['Annual Income (k$)'], features['Spending Score (1-100)'], c=cluster.labels_)
plt.xlabel("Annual Income", fontsize = 15)
plt.ylabel("Spending Score", fontsize = 15)
```

**Clusters at n= 3**

The number of optimal clusters formed according to the rule needs to be more accurate in this analysis. Better rely on intuition and domain knowledge for deciding the optimal number of clusters. In the above plot, we can confirm the presence of 4 clusters. Let's set the optimal clusters as five and visualize the position of clusters.

**Clusters at n = 4**

This is an excellent improvement in the visualization of clusters. Sometimes the rule returns misleading information, but we should always confirm the results based on our intuition and domain knowledge. Back to the problem statement, after analyzing the above scatter plot, we are interested in the red-colored cluster because these customers have a high annual income and a high spending score, and that's precisely what we are looking for. Now, we can hand over these customers to the marketing team. They will approach these customers with some planned offers, and since they belong to a common cluster, we can confidently say that most customers will respond similarly to the offers. This ends the marketing campaign for us.

- Easy to implement and gives good results in some cases.
- Unlike K-means, no need to define K clusters in the beginning.

- The execution time is pretty high.
- Hard to determine the accurate number of clusters using the dendrogram.

Due to its high time complexity, using hierarchical Clustering over a Million Row Dataset is out of the question. Hence, it is not preferred for large to medium-range datasets.

We started with a brief introduction to the customer segmentation problem statement and answered: why it is necessary to analyze the customer's spending habits. Further, we briefly discussed the working of Hierarchical Clustering using a small example. We also spent some time understanding the proximity matrix and dendrograms. Finally, we implemented the Hierarchical Clustering to our Customer Segmentation datasets and visualized the results using the Dendrogram and Scatterplot. We concluded the implementation, followed by the pros and cons of the Hierarchical Clustering algorithm.

**Next Blog:** Introduction to Principal Component Analysis