Difficulty: Medium, Asked-in: Google, Amazon, Uber, Hike
Key Takeaways
Given two array strings, X[] and Y[] of size m and n, write a program to find the length of longest common subsequence. Suppose m >= n.
Example 1
Input: X[] = [A, B, C, B, D, A, B], Y[]= [B, D, C, A, B, A]
Output: 4
Explanation: There are many common subsequences of X and Y. For example, the sequence [B, C, A] is a common subsequence of length 3 but it is not one of the longest common subsequence X and Y. Similarly, sequence [B, C, B, A], which is also common to both X and Y, has length 4. The sequence [B, C, B, A] is an LCS of X and Y, as is the sequence [B, D, A, B]. In simple words, X and Y have no common subsequence of length 5 or greater.
Example 2
Input: X[] = [E, B, T, B, C, A, D, F], Y[] = [A, B, B, C, D, G, F]
Output: 5
Explanation: The longest common subsequence is [B, B, C, D, F].
Solution idea of lcs using recursion
This is an optimization problem where solution space is huge i.e. there can be so many common subsequences of both strings, but we need to find a common subsequence with the longest length.
One idea would be to explore all the subsequences of both strings and find the longest among them. When we need to explore all possibilities, we can try to use the idea of recursion. The critical question would be: how do we explore all the subsequences recursively? Let's think!
Here we have one obvious choice at the start: are the last characters of both strings equal or not? If both are equal, that common character will be part of the longest subsequences.
Let's assume function lcs(X, Y, m, n) returns the length of the longest subsequence.
Case 1: If the last characters of strings are equal i.e. if(X[m-1] == Y[n-1]), then problem gets reduced to finding longest subsequence of the remaining substring of input size m - 1 and n - 1 i.e. lcs(X, Y, m, n) = 1 + lcs(X, Y, m - 1, n - 1)
Case 2: If the last characters of strings are not equal i.e. if (X[m-1] != Y[n-1]), there are two possibilities of smaller sub-problems, and we need to find the maximum of them:
Possibility 1: Find the longest subsequence length by excluding the last character of the string X and including the last character of the string Y i.e. lcs(X, Y, m - 1, n). Here the input size of X and Y is m - 1 and n.
Possibility 2: Find the longest subsequence length by including the last character of the string X and excluding the last character of the string Y i.e. lcs(X, Y, m, n - 1). Here the input size of X and Y is m and n - 1.
Now any of both possibilities can provide the length of the longest LCS. So we need to take the maximum of both possibilities i.e. lcs(X, Y, m, n) = max ( lcs(X, Y, m - 1, n), lcs(X, Y, m, n - 1) )
Base case: Recursion will terminate when any one of the string sizes gets reduced to 0 i.e. if (m == 0 || n == 0), return 0.
Solution pseudocode of lcs using recursion
int lcs(string X[], string Y[], int m, int n)
{
if (m == 0 || n == 0)
return 0
if (X[m - 1] == Y[n - 1])
return 1 + lcs(X, Y, m - 1, n - 1)
else
return max (lcs(X, Y, m, n - 1), lcs(X, Y, m - 1, n))
}
Time and space complexity analysis
We are exploring all possible subsequences of both strings using recursion and comparing the last character. To explore all possibilities, there would be two choices for each character in a string: Either it is part of the subsequence or not part of the subsequence.
Space complexity depends on the call stack size, which is equal to the height of recursion tree. Here input parameters (m and n) are at max decreasing by 1, so the height of recursion tree = max(m, n). Space complexity = O(max(m, n)). (Think!)
Now the critical question would be: why is time complexity exponential? The answer would be simple: we have overlapping sub-problems! In other words, we repeatedly solve the same sub-problems during the recursion. This can be better visualized by drawing the recursion tree.
In the following diagram of recursion tree, the subproblem of size (m - 2, n - 2) comes twice. We can observe other repeating subproblems if we draw recursion tree further.
Solution idea of lcs using memoization
The problem with the recursive solution is that the same subproblems get solved many times. A subproblem consists of two parameters, m and n, at-max decreasing by 1 during each recursive call. So there are (m+1)*(n+1) possible subproblems, and some of these subproblems must be solved over and over.
So one efficient idea would be to use the memoization approach of dynamic programming and store the solution of each sub-problems in an extra memory or lookup table. When we reencounter the same sub-problem during the recursion, we look into the extra memory and return already calculated solution directly instead of computing it again. This could help us to reduce a lot of computation and improve the time complexity significantly. Think!
Note: The dynamic programming solution checks whether we've already solved the sub-problem before. If yes, we look up the solution stored in the memory instead of recomputing it. Implemented most directly, we need to add some extra lines of code in our recursive algorithm. This "top-down" recursive version of dynamic programming is also known as "memoization".
Solution pseudocode of lcs using memoization
int lcsLength(string X[], string Y[], int m, int n)
{
int LCS[m + 1][n + 1]
for(int i = 0; i <= m; i = i + 1)
{
for (j = 0; j <= n; j = j + 1)
{
LCS[i][j] = -1
}
}
return lcs(X, Y, m, n, LCS)
}
int lcs(string X[], string Y[], int m, int n, int LCS[][])
{
if (m == 0 || n == 0)
return 0
if (LCS[m][n] < 0)
{
if (X[m - 1] == Y[n - 1])
LCS[m][n] = 1 + lcs(X, Y, m - 1, n - 1, LCS)
else
LCS[m][n] = max (lcs(X, Y, m, n - 1, LCS), lcs(X, Y, m - 1, n, LCS))
}
return LCS[m][n]
}
Time and space complexity analysis
In the worst case, we will be solving each subproblem only once and there are (m + 1)*(n + 1) different sub-problems, so time complexity = O(m*n). We are using a 2-D table of size (m+1)*(n+1) to store the solutions of the subproblems, so space complexity = O(m*n).
As usual, this is a worst-case analysis. The time might sometimes be better if not all array entries get filled out. For instance, if two strings match exactly, we only fill in diagonal entries and the algorithm will be fast. Think
Solution Idea and Steps
Since we have identified that it is a dynamic programming problem, we can solve it efficiently using the bottom-up approach. Our aim is to calculate the solution of the smaller problems in an iterative way and store their result in a table. But the critical question is : how do we build the solution in a bottom-up manner? Here are the important steps:
Iterative structure to fill the table: Now, we need to define the iterative structure to fill the table LCS[][] i.e. a relation by which we build the solution of the larger problem using the solution of smaller problems in a bottom-up manner. We can easily define the iterative structure by using the recursive structure of the recursive solution.
LCS[i][j] = 1 + LCS[i-1][j-1], if(X[i-1] == Y[j-1])
LCS[i][j] = max (LCS[i][j-1], LCS[i-1][j]), if(X[i-1] != Y[j-1])
Solution Pseudocode
int lcs(string X[], string Y[], int m, int n)
{
int LCS[m + 1][n + 1]
for(int i = 0; i <= m; i = i + 1)
LCS[i][0] = 0
for(int j = 0; j <= n; j = j + 1)
LCS[0][j] = 0
for(int i = 1; i <= m; i = i + 1)
{
for(int j = 1; j <= n; j = j + 1)
{
if(X[i - 1] == Y[j - 1])
LCS[i][j] = 1 + LCS[i - 1][j - 1]
else
LCS[i][j] = max (LCS[i - 1][j], LCS[i][j - 1])
}
}
return LCS[m][n]
}
Time and space complexity analysis
Time complexity = Time complexity of initializing the table + Time complexity of filling the table into a bottom-up manner + time complexity of returning the final solution = O(m + n) + O(mn) + O(1) = O(mn)
Space complexity = O(mn) for storing the table size (m+1)*(n+1).
If we observe the previous 2-D solution of the problem, we are only using the adjacent indexes in the table to build the solution in a bottom-up manner. In other words, we are using LCS[i-1][j-1], LCS[i][j-1] and LCS[i-1][j] to fill the position LCS[i][j]. So there are two basic observations:
So no need to store all rows in the DP matrix and we can just store two rows at a time. In other words, we can use a two-row 2D array LCS[2][n+1] to store values and get the output.
Solution Pseudocode
int lcs(string X[], string Y[], int m, int n)
{
int LCS[2][n + 1]
for (int i = 0 ; i <= 1; i = i + 1)
LCS[i][0] = 0
for (int j = 0; j <= n; j = j + 1)
LCS[0][j] = 0
for (int i = 1; i <= m; i = i + 1)
{
for (int j = 1; j <= n; j = j + 1)
{
if (X[i - 1] == Y[j - 1])
LCS[i % 2][j] = 1 + LCS[(i-1) % 2][j - 1]
else
LCS[i % 2][j] = max(LCS[(i - 1) % 2][j], LCS[i % 2][j - 1])
}
}
return LCS[m % 2][n]
}
Time and space complexity analysis
There is no change in time complexity in comparison to the previous approach (as we are using two nested loops similar to the previous approach). So time complexity = O(mn). Space complexity gets reduced to O(n) because the total number of rows is constant.
Solution Idea and Steps
Now the critical question would be: can we optimize the space complexity further? The idea is: In the last approach, we are using a two-row 2-D table and storing values in a row-wise fashion using two values from the previous row. So can we store one row of data and track two previous rows values using two variables? Let's think!
Now we run nested loops similar to the last approach. Here at the ith iteration of the outer loop, we store ith row values in table LCS[n] using the inner loop. Suppose after the (i-1)th iteration of the outer loop, we have all the values of (i-1)th row stored in the table LCS[n]. Now we try to store the values of the ith row using the inner loop.
We highly recommend visualizing this approach on paper by drawing the flow of storing values in the table by updating variables. Think!
Solution Pseudocode
int lcs(string X[], string Y[], int m, int n)
{
int LCS[n + 1]
LCS[0] = 0
for(int i = 1; i <= m; i = i + 1)
{
int prevRow = 0, prevRowPrevCol = 0
for(int j = 1; j <= n; j = j + 1)
{
prevRowPrevCol = prevRow
prevRow = LCS[j]
if(X[i-1] == Y[j-1])
LCS[j] = prevRowPrevCol + 1
else
LCS[j] = max(LCS[j-1], prevRow)
}
}
return LCS[n]
}
Time and space complexity analysis
There is no change in time complexity as we use two nested loops similar to the previous approach. So time complexity = O(mn). Space complexity is still O(n), but it is the optimized version of the previous approach because we only use a 1-D table to store one row.
Enjoy learning, Enjoy problem-solving!
Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.