Sort Characters by Frequency

Difficulty: Medium, Asked-in: Google, Microsoft, Bloomberg, Paypal

Key takeaway: An excellent problem to learn problem-solving using sorting and data structures like BST and hash table. We are storing additional information to sort the string in a stable order.

Let's understand the problem!

Given a string S, write a program to sort string in decreasing order based on the frequency of characters.

The frequency of a character is the number of times it appears in the string.
In the output, characters with higher frequency come first.
If two characters have the same frequency, whichever occurs earliest in S, must come first. In other words, the sorting must be stable.
Assume string consists of uppercase and lowercase English letters. We need to consider the uppercase and lowercase of the same character as a different one.

Important note: Before moving on to the solutions, we recommend trying this problem on paper for atleast 15 or 30 minutes. Enjoy problem-solving!

Example 1

Input: S = "tree", Output: "eetr"

Explanation: 'e' appears twice while 'r' and 't' both appear once. So 'e' must appear before both 'r' and 't'.

Example 2

Input: S = "ccaaaAA", Output: "aaaccAA"

Explanation: 'a' appears three times while 'c' and 'A' appear twice. So 'a' must appear before both 'c' and 'A'. Similarly, the frequency of 'c' and 'A' are the same. To maintain the stable order 'c' must appear before 'A'.

Discussed solution approaches

Brute force approach using nested loops
Solution approach using a binary search tree
An efficient solution using hash table

Solution approach 1: Brute force approach using nested loops

Solution idea

To sort characters based on their frequency:

We should identify each unique character in the input string.
We also need to know the frequency of each unique character.
Sorting must be stable, so we need to design a method to distinguish the position of characters with the same frequency in sorted order. In such a situation: A character that occurs first in the input string must come first in the sorted output.

So how do we ensure the above things? One basic idea would be:

We create an extra memory and store frequency count and first occurrence of each unique input string character.
Now we sort this extra memory based on frequency count and first occurrence. We are storing the first occurrence for a simple reason: This will help us to maintain a stable sorting. In other words, if two characters have the same frequency, a character with lower first index will come first in sorted output.
We traverse the above sorted list and extract all characters based on their frequency count in an extra memory of size n.

Solution steps

We define a class or structure UniqueChar, which stores three information related to each unique character: character value, frequency count, and first occurrence in the input string.
```
struct UniqueChar
{
   char ch
   int freqCount
   int firstIndex
}
```
Now we declare an additional list freq[], where each element stores the object type UniqueChar i.e. vector freq. We are taking a list, not an array, because we don't know the number of unique characters in the input string. One idea would be clear: This list's size will equal the total number of unique characters.

Now we run a nested loop to identify each unique character and store their frequency count and first occurrence in the freq[] list. Here outer loop iterates through each character, and the inner loop checks its unique occurrence in the freq[] list.

For each character S[i], we check if it already exists in the freq[] list or not. If it does, we increase its frequency count by 1.
Otherwise, it will be the first occurrence of S[i]. So we insert S[i] in the freq[] list with the frequency count as 1 and index i as a first occurrence.
After the nested loops, freq[] stores all the unique characters with additional information.

for (int i = 0; i < n; i = i + 1)
{
    bool found = false
    for (int j = 0; j < freq.length; j = j + 1)
    {
        if (S[i] == freq[j].ch)
        {
            freq[j].freqCount = freq[j].freqCount + 1
            found = true
            break
        }
    }
        
    if (found == false)
    {
        UniqueChar temp
        temp.ch = S[i]
        temp.freqCount = 1
        temp.firstIndex = i
        freq.push_back(temp)
    }
}

Now we sort the freq[] list based on the frequency count of characters. If the frequency count is equal for two elements, we should first process the element with lower firstIndex. Our sorting procedure can use a comparator to ensure this.
```
sort(freq, freq.length, comp)
bool comp(UniqueChar x, UniqueChar y)
{
   if (x.freqCount != y.freqCount)
       return x.freqCount > y.freqCount
   else
       return x.firstIndex < y.firstIndex
}
```

After this, we extract characters from the sorted freq[] list and store it in an extra output array result[n].

char result[n]
int k = 0
for (int i = 0; i < freq.length; i = i + 1)
{
   for (int j = 0; j < freq[i].freqCount; j = j + 1)
   {
       result[k] = freq[i].ch
       k = k + 1
   }    
}

Finally, we return the result[] array as an output.

Solution code C++

// Struct to store unique characters and their frequency in the input string
struct UniqueChar 
{
    char ch;
    int freqCount;
    int firstIndex;
};

// Comparison function for sorting the unique characters by frequency
bool comp(UniqueChar x, UniqueChar y) 
{
    if (x.freqCount != y.freqCount)
        return x.freqCount > y.freqCount;
    else
        return x.firstIndex < y.firstIndex;
}

// Function to sort the input string by frequency of characters
vector<char> frequencySort(char S[], int n) 
{
    vector<UniqueChar> freq;
    for (int i = 0; i < n; i = i + 1) 
    {
        bool found = false;
        for (int j = 0; j < freq.size(); j = i + 1) 
        {
            if (S[i] == freq[j].ch) 
            {
                freq[j].freqCount = freq[j].freqCount + 1;
                found = true;
                break;
            }
        }
        
        if (!found) 
        {
            UniqueChar temp;
            temp.ch = S[i];
            temp.freqCount = 1;
            temp.firstIndex = i;
            freq.push_back(temp);
        }
    }
    
    sort(freq.begin(), freq.end(), comp);
    vector<char> result;
    for (int i = 0; i < freq.size(); i = i + 1) 
    {
        for (int j = 0; j < freq[i].freqCount; j = j + 1)
            result.push_back(freq[i].ch);
    }
    
    return result;
}

Solution analysis

Suppose there are m unique characters in the input string. Time complexity = Time complexity of storing values in freq[] list + Time complexity of sorting freq[] list + Time complexity of storing output in result[] = O(mn) + O(m) + O(n) = O(mn)

Why is the time complexity of sorting freq[] list O(m)? The idea would be simple: If only uppercase and lowercase English letters are part of the input, there can be only 52 possible unique characters repeated several times. So we can use the idea of bucket sort to sort the input in O(m) time! In other words, there is no need to use comparison-based sorting algorithms like merge sort or heap sort! Think!

Space complexity = Extra space of freq[] list + Extra space used by sorting + Storing result[] array = O(m) + O(1) + O(n) = O(m + n)

Solution approach 2: Using binary search tree

Solution idea

In the above approach, the nested loops dominate the time complexity, which is O(nm). If we observe the loop: We are searching for the unique presence of each character in the freq[] list and inserting it with additional information. So one idea is clear: Searching is a critical operation that takes O(m) time for each character in the worst case. The critical question is: Can we use some other efficient data structure to improve the time complexity of searching?

A binary search tree can be a useful data structure for frequent search queries, where the time complexity depends on the tree's height. Although, if we use the idea of a balanced binary search tree such as an AVL Tree or Red-Black tree, the worst-case time complexity for searching, insertion, and deletion would be O(logn).

We can use the following node structure to store each unique character in a BST with two additional parameters: freqCount to store the frequency count and firstIndex for string the first occurrence.

Note: We will be assuming a balanced BST for the implementation.

class BSTNode
{
    char ch
    int freqCount
    int firstIndex
    BSTNode *left
    BSTNode *right
}

Solution steps

We create a BST by inserting each unique character from the input string. Before inserting a character S[i] in BST, we check if it already exists or not. If it does, we increase the frequency count by 1. Else, we insert a new node with frequency count 1 and i as a first occurrence.
Now we create an extra object list freq[] (like the above solution) and perform an in-order traversal of BST to store unique characters with other details in freq[].
Now we follow the procedure of the previous approach:
We sort the freq[] list based on the frequency count of characters while maintaining the stable order.
We extract characters from the sorted freq[] list and store it in an extra output array result[n].
Finally, we return the result[] array as an output.

Solution code C++

struct UniqueChar
{
    char ch;
    int freqCount;
    int firstIndex;
};

class BSTNode
{
public:
    char ch;
    int freqCount;
    int firstIndex;
    BSTNode *left;
    BSTNode *right;
};

// Create a new BST node
BSTNode* Newnode()
{
    BSTNode* node = new BSTNode();
    node->left = node->right = nullptr;
    return node;
}

// Comparison function for sorting UniqueChar objects
bool comp(UniqueChar x, UniqueChar y)
{
    if (x.freqCount != y.freqCount)
        return x.freqCount > y.freqCount;
    else
        return x.firstIndex < y.firstIndex;
}

// Search for a node with a given character in a BST
BSTNode* bstSearch(BSTNode* root, char ch)
{
    if (root == nullptr || root->ch == ch)
        return root;
    if (ch < root->ch)
        return bstSearch(root->left, ch);
    return bstSearch(root->right, ch);
}

// Insert a new node into a BST
void bstInsert(BSTNode*& root, BSTNode* node)
{
    if (root == NULL)
    {
        root = node;
        return;
    }
    if (node->ch < root->ch)
        bstInsert(root->left, node);
    else
        bstInsert(root->right, node);
}

// Perform inorder traversal of a BST and store the nodes in a vector
void bstInorder(BSTNode* root, vector<UniqueChar>& freq)
{
    if (root == NULL)
        return;
    bstInorder(root->left, freq);
    UniqueChar uc;
    uc.ch = root->ch;
    uc.freqCount = root->freqCount;
    uc.firstIndex = root->firstIndex;
    freq.push_back(uc);
    bstInorder(root->right, freq);
}

vector<char> frequencySort(char S[], int n)
{ 
    BSTNode* root = NULL;
    BSTNode* temp = NULL;
    for(int i = 0; i < n; i = i + 1)
    {
        temp = bstSearch(root, S[i]);
        if (temp != NULL)
            temp->freqCount = temp->freqCount + 1;
        else if (temp == NULL)
        {
            BSTNode* node = Newnode();
            node->ch = S[i];
            node->firstIndex = i;
            node->freqCount = 1;
            node->left = NULL;
            node->right = NULL;
            bstInsert(root, node);
        }
    }
    
    vector<UniqueChar> freq;
    bstInorder(root, freq);
    sort(freq.begin(), freq.end(), comp);
    vector<char> result;
    for (int i = 0; i < freq.size(); i = i + 1) 
    {
        for (int j = 0; j < freq[i].freqCount; j = j + 1)
            result.push_back(freq[i].ch);
    }
    
    return result;
}

Solution analysis

Suppose there are m unique characters in the input string.

Time complexity = Time complexity of inserting each characters in a BST + Time complexity of extracting characters to freq[] list using inorder traversal+ Time complexity of sorting freq[] list + Time complexity of storing output in result[] = O(nlogm) + O(m) + O(m) + O(n) = O(nlogm)

Space complexity = Extra space for BST + Extra space for storing freq[] + Extra space for storing result[] + Extra space used by sorting = O(m) + O(m) + O(n) + O(1) = O(m + n)

Efficient solution using the hash table

Solution idea

Can we improve the time complexity further by using a hash table? Hashing is one of the efficient ways to store data and perform the searching operation in O(1) time average. The critical question is: Given a hash table of characters with their frequency, how can we perform a stable sorting on it? The idea is simple: We can use another hash table that stores the first occurrence of each unique character.

Solution steps

We create a hash table freqCount to store each unique character as a key and their frequency as a value.
We create another hash table firstIndex to store each unique character as a key and their first index in the input string as a value.
We linearly traverse input string for each character S[i]:
- If it exists in the freqCount, we increase its frequency count by 1.
- If it does not exist in the freqCount, we insert it with frequency count 1 and insert it in the hash table firstIndex as value.
Now we follow a similar procedure used in the previous approach:
- We create a freq[] list like the above approaches using freqCount and firstIndex Hash Tables.
- We sort the freq[] list based on the frequency count of characters while maintaining the stable order.
- We extract characters from the sorted freq[] list and store it in an extra output array result[n].
- Finally, we return the result[] array as an output.

Solution code C++

struct UniqueChar
{
    char ch;
    int freqCount;
    int firstIndex;
};

// Comparison function for sorting UniqueChar objects
bool comp(UniqueChar x, UniqueChar y)
{
    if (x.freqCount != y.freqCount)
        return x.freqCount > y.freqCount;
    else
        return x.firstIndex < y.firstIndex;
}

vector<char> sortByFrequency(char S[], int n)
{
    unordered_map<char, int> freqCount;
    unordered_map<char, int> firstIndex;
    for(int i = 0; i < n; i = i + 1)
    {
        if (freqCount.count(S[i]) > 0)
            freqCount[S[i]] = freqCount[S[i]] + 1;
        else
        {
            freqCount[S[i]] = 1;
            firstIndex[S[i]] = i;
        }
    }
    vector<UniqueChar> freq;
    for(auto& [ch, count] : freqCount)
    {
        UniqueChar temp;
        temp.ch = ch;
        temp.freqCount = count;
        temp.firstIndex = firstIndex[ch];
        freq.push_back(temp);
    }
    
    sort(freq.begin(), freq.end(), comp);
    vector<char> result;
    for (auto& uc : freq)
    {
        for(int j = 0; j < uc.freqCount; j = j + 1)
            result.push_back(uc.ch);
    }       
    return result;
}

Solution analysis

Suppose there are m unique characters in the input string.

Time complexity = Time complexity of storing freqCount and firstIndex hash tables + Time complexity of storing values in freq[] + Time complexity of sorting freq[] + Time complexity of storing output in result[] array = O(n) + O(n) + O(m) + O(n) = O(m + n)

Space complexity = Extra space of freqCount and firstIndex hash tables + Extra space of freq[] list + Extra space of result[] array + Extra space of sorting = 2*O(m) + O(m) + O(n) + O(1) = O(m + n)

Important note: We recommend transforming the above pseudo-codes into a favourite programming language (Java, Python, etc.) and verifying all the test cases. Enjoy programming!

Critical ideas to think!

How do we solve the above problem if there is no constraint to maintain a stable sorted order?
How do we implement the sorting algorithm used in the above approaches? Here we have only 52 unique characters are allowed in the input. So, how can we sort any such input in O(n) time complexity?
Can we solve this problem using some other approach?
How do we modify the above idea to sort characters in increasing order of frequency?
Can we try to solve this problem without using extra space?
Explore some problems where we use two hash tables for the solution.
Compare BST and Hash Table. In which scenario do we prefer BST over a hash table?
In the 2nd approach: why are we extracting elements from BST and storing them in freq[] list? Write a function to Implement inorderStore(root, freq) function.
Think about the other scenario where we could store extra details in BST nodes to solve the problem.

Sort String Based on Frequency of Characters

Let's understand the problem!

Example 1

Example 2

Discussed solution approaches

Solution approach 1: Brute force approach using nested loops

Solution idea

Solution steps

Solution code C++

Solution analysis

Solution approach 2: Using binary search tree

Solution idea

Solution steps

Solution code C++

Solution analysis

Efficient solution using the hash table

Solution idea

Solution steps

Solution code C++

Solution analysis

Critical ideas to think!

Suggested problems to solve

More from EnjoyAlgorithms

Self-paced Courses and Blogs