Sort Characters by Frequency

EnjoyAlgorithms Blog Cover Image

Difficulty: Medium, Asked-in: Google, Microsoft, Bloomberg, Paypal

Key takeaway: an excellent problem to learn problem-solving using sorting and data structures like BST and hash table! We are storing additional information to sort the string in a stable order.

Let's understand the problem!

Given a string S, write a program to sort it in decreasing order based on the frequency of the characters.

  • The frequency of a character is the number of times it appears in the string.
  • In the output, characters with higher frequency come first.
  • If two characters have the same frequency, whichever occurs earliest in S, must come first. In other words, the sorting must be stable.
  • Assume string consists of uppercase and lowercase English letters. We need to consider uppercase and lowercase of the same character as a different character.

Important note: before moving on to the solutions, we recommend trying this problem on paper for atleast 15 or 30 minutes. Enjoy problem-solving!

Example 1

Input: s = "tree", Output: "eetr"

Explanation: 'e' appears twice while 'r' and 't' both appear once. So 'e' must appear before both 'r' and 't'.

Example 2

Input: s = "ccaaaAA", Output: "aaaccAA"

Explanation: 'a' appears three times while 'c' and 'A' appear twice. So 'a' must appear before both 'c' and 'A'. Similarly, the frequency of 'c' and 'A' are the same. So to maintain the stable order 'c' must appear before 'A'.

Discussed solution approaches

  • A brute force approach using nested loops
  • A solution approach using a binary search tree
  • An effcient solution using the hash table

Solution approach 1: A brute force approach using nested loops

Solution idea

To sort characters based on their frequency:

  • We should identify each unique character in the input string.
  • We also need to know the frequency of each unique character.
  • Sorting must be stable, so we need to design a method to distinguish the position of characters with the same frequency in sorted order. In such a situation, the idea is simple: a character that occurs first in the input string must come first in the sorted output.

So how do we ensure the above things? One basic idea would be:

  • We create an extra memory and store frequency count and the first occurrence of each unique character of the input string.
  • Now we sort this extra memory based on frequency count and first occurrence. We are storing the first occurrence for a simple reason: it will help us to maintain a stable sorting. In other words, if two characters have the same frequency then a character with the lower first index will come first in the sorted output.
  • Now we traverse the above sorted list and extract all the charcaters based on their frequency count in an extra memory of size n.

Solution steps

  1. We define a class called or structure UniqueChar, which stores three information related to each unique character: the character value, frequency count, and first occurrence in the input string.

    class UniqueChar
    {
       char ch
       int freqCount
       int firstIndex
    }
    
  2. Now we declare an additional list freq[], where each element stores the object type UniqueChar i.e. vector freq. We are taking a list not an array because we don't know the number of unique characters in the input string. But one idea would be clear: the size of this list will be equal to the total number of unique characters.
  3. Now we run a nested loop to identify each unique character and store their frequency count and first occurrence in the freq[] list. Here outer loop iterates through each character, and the inner loop checks its unique occurrence in the freq[] list.

    • For each character S[i], we check if it already exists in the freq[] list or not. If it does, we increase its frequency count by 1.
    • Otherwise, it will be the first occurrence of S[i]. So we insert S[i] in the freq[] list with the frequency count as 1 and index i as a first occurrence.
    • After the nested loops, freq[] stores all the unique characters with additional information.
    for (int i = 0; i < n; i = i + 1)
    {
        bool found = false
        for (int j = 0; j < freq.length; j = j + 1)
        {
            if (S[i] == freq[j].ch)
            {
                freq[j].freqCount = freq[j].freqCount + 1
                found = true
                break
            }
        }
            
        if (found == false)
        {
            UniqueChar temp
            temp.ch = S[i]
            temp.freqCount = 1
            temp.firstIndex = i
            freq.append(temp)
        }
    }
    
  4. Now we sort the freq[] list based on the frequency count of characters. If the frequency count is equal for two elements, we should first process the element with lower firstIndex. Our sorting procedure can use a comparator to ensure this.

    sort(freq, freq.length, comp)
    bool comp(UniqueChar x, UniqueChar y)
    {
       if (x.freqCount != y.freqCount)
           return x.freqCount > y.freqCount
       else
           return x.firstIndex < y.firstIndex
    }
    
  5. After this, we extract characters from the sorted freq[] list and store it in an extra output array result[n].

    char result[n]
    int k = 0
    for (int i = 0; i < freq.length; i = i + 1)
    {
       for (int j = 0; j < freq[i].freqCount; j = j + 1)
       {
           result[k] = freq[i].ch
           k = k + 1
       }    
    } 
    
  6. Finally, we return the result[] array as an output.

Solution pseudocode C++

int[] frequencySort(char S[], int n)
{
    vector<UniqueChar> freq
    for (int i = 0; i < n; i = i + 1)
    {
        bool found = false
        for (int j = 0; j < freq.length; j = j + 1)
        {
            if (S[i] == freq[j].ch)
            {
                freq[j].freqCount = freq[j].freqCount + 1
                found = true
                break
            }
        }
        
        if (found == false)
        {
            UniqueChar temp
            temp.ch = S[i]
            temp.freqCount = 1
            temp.firstIndex = i
            freq.append(temp)
        }
    }
  
    sort(freq, freq.length, comp)
    
    char result[n]
    int k = 0
    for (int i = 0; i < freq.length; i = i + 1)
    {
        for (int j = 0; j < freq[i].freqCount; j = j + 1)
        {
            result[k] = freq[i].ch
            k = k + 1
        }    
    }       
    return result
}

Solution analysis

Suppose there are m unique characters in the input string. Time complexity = Time complexity of storing values in freq[] list + Time complexity of sorting freq[] list + Time complexity of storing output in result[] = O(mn) + O(m) + O(n) = O(mn)

Why time complexity of sorting freq[] list is O(m)? The idea would be simple: if only uppercase and lowercase English letters are part of the input, then there could be only 52 possible unique characters repeated several times. So we can use the idea of bucket sort to sort the input in O(m) time! In other words, there is no need to use comparison-based sorting algorithms like merge sort or heap sort! Think!

Space complexity = Extra space of freq[] list + Extra space used by sorting + Storing result[] array = O(m) + O(1) + O(n) = O(m + n)

Solution approach 2: Using binary search tree

In the above approach, the nested loops dominate the time complexity, which is O(nm). If we observe the loop: we are searching the unique presence of each character in the freq[] list and inserting it with additional information. So one idea is clear: searching is a critical operation that takes O(m) time for each character in the worst case. The critical question is: can we use some other effcient data structure to improve the time complexity of searching?

A binary search tree can be a useful data structure for frequent search queries, where the time complexity depends on the tree's height. Although, if we use the idea of a balanced binary search tree such as AVL Tree or Red-Black tree, the worst-case time complexity for searching, insertion, and deletion would be O(logn).

We can use the following node structure to store each unique character in a BST with two additional parameters: freqCount to store the frequency count and firstIndex for string the first occurrence.

Note: we will be assuming a balanced BST for the implementation.

class BSTNode
{
    char ch
    int freqCount
    int firstIndex
    node *left
    node *right
}

Solution steps

  1. We create a BST by inserting each unique character from the input string. Before inserting a character S[i] in BST, we check if it already exists or not. If it does, we increase the frequency count by 1. Else, we insert a new node with frequency count 1 and i as a first occurrence.
  2. Now we create an extra object list freq[] (like the above solution) and perform an in-order traversal of BST to store unique characters with other details in freq[].
  3. Now we follow the procedure of the previous approach:
  4. We sort the freq[] list based on the frequency count of characters while maintaining the stable order.
  5. We extract characters from the sorted freq[] list and store it in an extra output array result[n].
  6. Finally, we return the result[] array as an output.

Solution pseudocode

class UniqueChar
{
    char ch
    int freqCount
    int firstIndex
}

int[] frequencySort(char S[], int n)
{ 
    BSTNode root = NULL
    BSTNode temp = NULL
    for(int i = 0; i < n; i = i + 1)
    {
        temp = search(root, S[i])  
        if (temp != NULL)
            temp->freqCount = temp->freqCount + 1
        else if (temp == NULL)
        {
            BSTNode node = Newnode()
            node->data = S[i]
            node->firstIndex = i
            node->freqCount = 1
            node->left = NULL
            node->right = NULL
            insert(root, node)  
        }
    }
    
    vector<UniqueChar> freq
    inorderStore(root, freq)
    sort(freq, freq.length, comp)
    char result[n]
    int k = 0
    for (int i = 0; i < freq.length; i = i + 1)
    {
        for(int j = 0; j < freq[i].freqCount; j = j + 1)
        {
            result[k] = freq[i].ch
            k = k + 1
        }    
    }       
    return result
}

bool comp(UniqueChar x, UniqueChar y)
{
    if (x.freqCount != y.freqCount)
        return x.freqCount > y.freqCount
    else
        return x.firstIndex < y.firstIndex
}

Solution analysis

Suppose there are m unique characters in the input string.

Time complexity = Time complexity of inserting each characters in a BST + Time complexity of extracting characters to freq[] list + Time complexity of sorting freq[] list + Time complexity of storing output in result[] = O(nlogm) + O(m) + O(m) + O(n) = O(nlogm)

Space complexity = Extra space for BST + Extra space for storing freq[] + Extra space for storing result[] + Extra space used by sorting = O(m) + O(m) + O(n) = O(m + n)

An effcient solution using the hash table

Solution idea

Can we improve the time complexity further by using a hash table? Hashing is one of the effcient ways to store data and perform the searching operation in O(1) time average. But a critical question is: given a hash table of characters with their frequency, how can we perform a stable sorting on it? The idea is simple: we can use another hash table that stores the first occurrence of each unique character.

Solution steps

  1. We create a hash table freqCount to store each unique character as a key and their frequency as a value.
  2. We create another hash table firstIndex to store each unique character as a key and their first index in the input string as a value.
  3. We linearly traverse the input string for each character S[i]:

    • If it exists in the freqCount, we increase its frequency count by 1.
    • If it does not exist in the freqCount, we insert it with frequency count 1 and insert it in the hash table firstIndex as value.
  4. Now we follow a similar procedure used in the previous approach:

    • We create a freq[] list like the above approaches using freqCount and firstIndex Hash Tables.
    • We sort the freq[] list based on the frequency count of characters while maintaining the stable order.
    • We extract characters from the sorted freq[] list and store it in an extra output array result[n].
    • Extract characters to result[] array according to their frequency and return it.

Solution pseudocode

class UniqueChar
{
    char ch
    int freqCount
    int firstIndex
}

int[] sortByFrequency(char S[], int n)
{
    Hash Table freqCount 
    Hash Table firstIndex
    for(int i = 0; i < n; i = i + 1)
    {
        if (freqCount.search(S[i]) == true)
            freqCount[S[i]] = freqCount[S[i]] + 1
        else
        {
            freqCount[S[i]] = 1
            firstIndex[S[i]] = i
        }
    }
    vector<UniqueChar> freq
    for(each charcater x in freqCount)
    {
        UniqueChar temp
        temp.ch = x
        temp.freqCount = freqCount[x]
        temp.firstIndex = firstIndex[x]
        freq.append(temp)
    }
    
    sort(freq, freq.length, comp)
    char result[n]
    int k = 0
    for (int i = 0; i < freq.length; i = i + 1)
    {
        for(int j = 0; j < freq[i].freqCount; j = j + 1)
        {
            result[k] = freq[i].ch
            k = k + 1
        }    
    }       
    return result
}

bool comp(UniqueChar x, UniqueChar y)
{
    if (x.freqCount != y.freqCount)
        return x.freqCount > y.freqCount
    else
        return x.firstIndex < y.firstIndex
}

Solution analysis

Suppose there are m unique characters in the input string.

Time complexity = Time complexity of seaching and insertion in freqCount and firstIndex hash tables + Time complexity of storing values in freq[] + Time complexity of sorting freq[] + Time complexity of storing output in result[] array = O(n) + O(n) + O(m) + O(n) = O(m + n)

Space complexity = Extra space of freqCount and firstIndex hash tables + Extra space of freq[] list + Extra space of result[] array + Extra space of sorting = 2*O(m) + O(m) + O(n) + O(1) = O(m + n)

Important note: we recommend transforming the above pseudo-codes into a favorite programming language (C, C++, Java, Python, etc.) and verifying all the test cases. Enjoy programming!

Critical ideas to think!

  • How do we solve the above problem if there is no constraint to maintain a stable sorted order?
  • How do we implement the sorting algorithm used in the above approaches? Here we have only 52 unique characters are allowed in the input. So, how can we sort any such input in O(n) time complexity?
  • Can we solve this problem using some other approach?
  • How do we modify the above idea to sort characters in increasing order of frequency?
  • Can we try to solve this problem without using extra space?
  • Explore some problems where we use two hash tables for the solution.
  • Compare the BST and Hash Table. In which scenario do we prefer BST over a hash table?
  • In the 2nd approach: why are we extracting elements from BST and storing them in freq[] list? Write a function to Implement inorderStore(root, freq) function.
  • Think about the other scenario where we could store extra details in BST nodes to solve the problem.

Suggested problems to solve

  • Perform minimum deletion operations such that all elements have a unique frequency
  • Find the most frequent element in an array
  • Find the sum of all elements with an even frequency
  • Sort the array except for elements with frequency k

Please write in the message below if you find anything incorrect, or you want to share more insight, or you know some different approaches to solve this problem. Enjoy learning, Enjoy algorithms!

We'd love to hear from you

More content from EnjoyAlgorithms

Our weekly newsletter

Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops and math. enjoy learning!