We will be solving here a LeetCode problem. The problem is to find a pattern & all its anagrams in a String. You will find similar problem in InterviewBit or GeeksforGeeks too.
Problem:
Given two strings s and p, return an array of all the start indices of p’s anagrams in s. You may return the answer in any order.
String can contain only lowercase english letters. The length of p is m. The length of s is n.
Input:
Text: cedaabd
Pattern: abad
Let’s consider the example above. I have listed all the anagrams of pattern abad:
aabd, aadb, abad, abda, adab, adba, baad, bada, bdaa, daab, daba, dbaa
So in total we have 12 patterns. There are two problems here if we want to solve this efficiently without brute force.
Problem 1:
Let’s say we have only one pattern to match instead of 12. As you can see from the pseudocode below, we have a nested loop. So worst time complexity would be O(m * n). Worst time complexity is not linear.
for i in 0 to n-1
for j in 0 to m-1
if s[i+j] != p[j]
break;
end for
found a result
end for
Problem 2:
Now we would have to do the same process for all the anagrams. We would have to multiply the complexity with number of anagrams lets say x. So worst time complexity would eventually become O(m * n * x).
How do we solve it then? Can we find a proper linear algorithm to solve this problem? Yes, we can. We will use two things to make it work. Let’s understand the concept.
Point 1:
Can we find a way where we don’t have to match each anagram separately? For that, we can use a table where we would store count of each unique character in pattern p. All anagrams will have same count for unique characters. Anagrams are nothing but rearrangements of pattern p after all. How does this table help?
initialize pc // count table for p
for i in 0 to m-1
increment p[i] character count in pc
end for
for i in 0 to n-1
initialize sc // count table for s
for j in 0 to m-1
increment s[i+j] character count in sc
end for
if sc = pc
found a result
end for
Basically we will start a loop from first character in s & compute a count table with next m number of characters in s. If count tables for s & p have same count for each character, we are good. We found a result. We know that this substring of s will match an anagram of p because character count is same. We won’t know which exact anagram though. But that is not the requirement for the solution. This way we can avoid doing the same process again & again for each anagram. Above algorithm will take care of all possible anagrams. So we brought down worst time complexity from O(m * n * x) to O(m * n). But it still is non-linear.
Point 2:
The worst time complexity is still O(m * n). That is because above pseudocode has nested loop. To make time complexity linear, we need to get rid of the nested loop. For that we will be using sliding window technique.
initialize pc // count table for p
for i in 0 to m-1
increment p[i] character count in pc
end for
initialize sc // count table for s
for i in 0 to m-1
increment s[i] character count in sc
end for
if sc = pc
found a result
for i in m to n-1
decrement s[i-m] character in sc // first character in previous window
increment s[i] character count in sc // last character in current window
if sc = pc
found a result
end for
If you follow the pattern, basically we are shifting only one character in s in each step of the loop. So we don’t need to recalculate whole count table for current substring of s in each step. In first step, we are calculating count table based on 0 to m-1 indexed characters in s. In second step, we are calculating count table from 1 to m indexed characters in s. If you compare first & second steps, characters 1 to m-1 remain same in both steps. To convert count table from first step to second step, we just need to decrement the count of first character from the first step. Then increment count of last character from the second step. That’s all.
We don’t have any nested loop now. We looped through p. Time complexity for that is O(m). We also looped through s separately. Time complexity for that is O(n). So worst time complexity becomes O(m + n) which is linear.
But you might ask what about the comparison of count tables that you are doing in each step while you are looping through s. As described in the problem, only lowercase english letters are allowed. So we can have an array of 26 elements as a count table. In each step, we will be comparing 26 characters from both count tables. The size of count table doesn’t depend on m or n. So it takes constant time. You can omit constant time from time complexity calculation. So O(c * n) becomes O(n) where c is constant. Ultimately the worst time complexity of the above solution becomes O(m + n).
Here is the working Java solution for the LeetCode problem based on above approach:
class Solution {
public List<Integer> findAnagrams(String s, String p) {
List<Integer> result = new ArrayList<Integer>();
int pLen = p.length();
if(s.length() < pLen){
return result;
}
int[] pCount = new int[26];
for(int i = 0; i < pLen; i++){
pCount[p.charAt(i) - 'a']++;
}
int[] sCount = new int[26];
for(int i = 0; i < pLen; i++){
sCount[s.charAt(i) - 'a']++;
}
if(isMatching(pCount, sCount)){
result.add(0);
}
for(int i = pLen; i < s.length(); i++){
sCount[s.charAt(i - pLen) - 'a']--;
sCount[s.charAt(i) - 'a']++;
if(isMatching(pCount, sCount)){
result.add(i - pLen + 1);
}
}
return result;
}
private boolean isMatching(int[] pCount, int[] sCount){
for(int i = 0; i < pCount.length; i++){
if(pCount[i] != sCount[i]){
return false;
}
}
return true;
}
}