In this post, I will try to explain Rabin-Karp algorithm & its worst time complexity. Rabin-Karp algorithm is based on hash matching.
Let’s take an example.
Text: aadabc
Pattern: abc
Let’s create a table as below & represent each letter with some arbitrary number. We will use it to calculate hash.
a b c d e
1 2 3 4 5
So how do we calculate the hash value? Our hash function is simple. We will add up the number values as per the table.
Pattern hash: a + b + c = 1 + 2 + 3 = 6
For comparing the text, we will take first 3 characters as our pattern length is 3. Then we will shift one letter in each step & recalculate hash.
Text hash: t[1] + t[2] + t[3] = a + a + d
= 1 + 1 + 4
= 6
So hash matched. We need to compare each character from 1 to 3 position with the characters in the pattern.
t: aadabc
p: abc
t[1] = p[1], match length 1, increment both t & p.
t: aadabc
p: abc
t[2] ≠ p[2], matching failed. So t[1] to t[3] is not matching with the pattern. Now we will match pattern against text characters from t[2] to t[4].
Text hash: t[2] + t[3] + t[4] = a + d + a
= 1 + 4 + 1
= 6
Patten hash & text hash matched again. We need to compare each text character from 2 to 4 position with the characters in the pattern.
t: aadabc
p: abc
t[2] = p[1], match length 1, increment both t & p.
t: aadabc
p: abc
t[3] ≠ p[2], matching failed. So t[2] to t[4] is not matching with the pattern. We need to match pattern against t[3] to t[5].
Text hash: t[3] + t[4] + t[5] = d + a + b
= 4 + 1 + 2
= 7
Hash matching failed. We will move forward & match pattern against t[4] to t[6].
Text hash: t[4] + t[5] + t[6] = a + b + c
= 1 + 2 + 3
= 6
So hash matched. We need to compare each character from 4 to 6 position with the characters in the pattern.
t: aadabc
p: abc
t[4] = p[1], match length 1, increment both t & p.
t: aadabc
p: abc
t[5] = p[2], match length 2, increment both t & p.
t: aadabc
p: abc
t[6] = p[3], match length 3, so we found our match.
Now let’s discuss few drawbacks of above example. When we are shifting one position in the text, we are recalculating the entire hash.
If we have pattern length 20, we will add up 20 values in each text hash calculation step. If we have pattern length 10, we will add 10 values in each text hash calculation step. So it’s not constant & time complexity of hash function depends on pattern length m.
If we use normal hash function, worst time complexity will be O(mn). That’s why we need to use concept of rolling hash function. It uses sliding window technique.
When we shift one position, we will remove first position value from previous hash & add current position value to that. So whatever the value of m is, we can recalculate text hash with only two operations in each step.
That is constant & that will bring down time complexity to O(n).
Here is how we can do that.
Text hash: t[1] + t[2] + t[3] = a + a + d = 1 + 1 + 4 = 6
To get hash for t[2] to t[4], we can do following.
t[1] + t[2] + t[3] = 6
6 - t[1] = 5
5 + t[4] = 6
So, t[2] + t[3] + t[4] = 6
There is one more thing. Whenever there is a hash match, we need to match pattern with that portion of text & verify if characters are same. In the example above, we saw that hash value is same but characters are not.
These false positives will increase time complexity. And if you are using a bad hash function, then you might get false positives in every step. You might need to compare half of the pattern characters on an average before you find a mismatch.
So time complexity on that case becomes (m / 2) * n or O(mn). That’s why Rabin-Karp algorithm has a worst time complexity of O(mn).
We can do some improvement on the hash function that can reduce time complexity to O(m+n) in most of the cases. I will try to write it in some future post.
But I hope you have got a clear understanding of this string searching algorithm & why its worst time complexity is O(mn).