Here I am just trying to explain KMP algorithm in plain english. I will also explain worst time complexity & why it is O(m + n).
We will take two examples, one with no repeatable character in the pattern & another with repeatable characters in the pattern.
Pattern With Repeatable Characters:
Text: aaaabaabab
Pattern: aaaaa
Longest Prefix Suffix Table:
First, we need to build a proper prefix suffix table. Let me explain the basic idea. We will take each substring of the pattern.
Here these are “a”,”aa”, “aaa”, “aaaa”. We are skipping “aaaaa” as that is the whole pattern. If we get a match for that, we are done.
For each of this substring, we will check if there are some consecutive characters which are present in the beginning & also at the end.
So why do we need that? Idea is simple. Suppose, we matched “aaaa”, then we had a mismatch. We know “aaa” is common in the beginning & at the end.
As we matched last three “a”, we know that starting three “a” are already matched. So we can skip them when we start matching again & directly go to 4th character. Let’s see below.
Here is the prefix suffix table for the example above. It’s easy to understand. All characters are same. So proper prefix suffix length will be substring length – 1.
Length: 1 2 3 4 Value: 0 1 2 3
Search Steps:
Now let’s match it.
t: aaaabaabab
p: aaaaa
t[1] = p[1], match length 1, increment both t & p.
t: aaaabaabab
p: aaaaa
t[2] = p[2], match length 2, increment both t & p.
t: aaaabaabab
p: aaaaa
t[3] = p[3], match length 3, increment both t & p.
t: aaaabaabab
p: aaaaa
t[4] = p[4], match length 4, increment both t & p.
t: aaaabaabab
p: aaaaa
t[5] ≠ p[5], match failed, we won’t move t in that case, prefix[4] = 3 from the table, so we set p to prefix[4] + 1 = 4.
t: aaaabaabab
p: aaaaa
t[5] ≠ p[4], match failed, we won’t move t in that case, prefix[3] = 2 from the table, so we set p to prefix[3] + 1 = 3.
t: aaaabaabab
p: aaaaa
t[5] ≠ p[3], match failed, we won’t move t in that case, prefix[2] = 1 from the table, so we set p to prefix[2] + 1 = 2.
t: aaaabaabab
p: aaaaa
t[5] ≠ p[2], match failed, we won’t move t in that case, prefix[1] = 0 from the table, as we set p to prefix[1] + 1 = 1.
t: aaaabaabab
p: aaaaa
t[5] ≠ p[1], match failed, p is at the first character, so we can’t decrement it & increment only t.
t: aaaabaabab
p: aaaaa
t[6] = p[1], match length 1, increment both t & p.
t: aaaabaabab
p: aaaaa
t[7] = p[2], match length 2, increment both t & p.
t: aaaabaabab
p: aaaaa
t[8] ≠ p[3], match failed, we won’t move t in that case, prefix[2] = 1 from the table, so we set p to prefix[2] + 1 = 2.
t: aaaabaabab
p: aaaaa
t[8] ≠ p[2], match failed, we won’t move t in that case, prefix[1] = 0 from the table, so we set p to prefix[1] + 1 = 1.
t: aaaabaabab
p: aaaaa
t[8] ≠ p[1], match failed, p is at the first character, so we can’t decrement it & increment only t.
t: aaaabaabab
p: aaaaa
t[9] = p[1], match length 1, increment both t & p.
t: aaaabaabab
p: aaaaa
t[10] ≠ p[2], match failed, we won’t move t in that case, prefix[1] = 0, from the table, so we set p to prefix[1] + 1 = 1.
t: aaaabaabab
p: aaaaa
t[10] ≠ p[1], match failed, p is at the first character, so we can’t decrement it & increment only t, but t is at the last position, so matching ends here without finding the pattern.
As you can see, t is never decremented. So if text length is n, we have n traversal there & not more than that.
But p gets decremented multiple times for same position of t. So if pattern length is m, is worst case time complexity O(mn)?
No. If you look at the above steps carefully, p was incremented 4 times & then only it got decremented 4 times consecutively. Again p was incremented 2 times & then it was decremented 2 times. At last p was incremented 1 time & then got decremented 1 time.
So basically to have n negative steps (non-matching steps), you need n positive steps (matching steps). And n positive steps means you are done traversing through the whole text.
So worst time complexity is n + n = 2n which is O(n).
The time complexity of building proper prefix suffix table is O(m). Probably I will write a separate blog post on that. Here I want to focus on the actual search algorithm of KMP.
So total time complexity of KMP algorithm is O(m + n).
In our example above, we took a pattern with all repeatable characters. Now let’s take an example where all characters are different.
Pattern Without Repeatable Characters:
Text: aaaabaabab Pattern: abcde
Longest Prefix Suffix Table:
Prefix suffix table is simple. There is no match among the characters in the pattern. So all values are 0.
Length: 1 2 3 4 Value: 0 0 0 0
Search Steps:
Now let’s match it.
t: aaaabaabab p: abcde
t[1] = p[1], match length is 1, increment both t & p.
t: aaaabaabab p: abcde
t[2] ≠ p[2], match failed, we won’t increment t in that case, prefix[1] = 0 from the table, so we set p to prefix[1] + 1 = 1.
t: aaaabaabab p: abcde
t[2] = p[1], match length 1, increment both t & p.
t: aaaabaabab p: abcde
t[3] ≠ p[2], match failed, we won’t increment t in that case, prefix[1] = 0 from the table, so we set p to prefix[1] + 1 = 1.
t: aaaabaabab p: abcde
t[3] = p[1], match length 1, increment both t & p.
t: aaaabaabab p: abcde
t[4] ≠ p[2], match failed, we won’t increment t in that case, prefix[1] = 0 from the table, so we set p to prefix[1] + 1 = 1.
t: aaaabaabab p: abcde
t[4] = p[1], match length 1, increment both t & p.
t: aaaabaabab p: abcde
t[5] = p[2], match length 2, increment both t & p.
t: aaaabaabab p: abcde
t[6] ≠ p[3], match failed, we won’t increment t in that case, prefix[2] = 0 from the table, so we set p to prefix[2] + 1 = 1.
t: aaaabaabab p: abcde
t[6] = p[1], match length 1, increment both t & p.
t: aaaabaabab p: abcde
t[7] ≠ p[2], match failed, we won’t increment t in that case, prefix[1] = 0 from the table, so we set p to prefix[1] + 1 = 1.
t: aaaabaabab p: abcde
t[7] = p[1], match length 1, increment both t & p.
t: aaaabaabab p: abcde
t[8] = p[2], match length 2, increment both t & p.
t: aaaabaabab p: abcde
t[9] ≠ p[3], match failed, we won’t increment t in that case, prefix[2] = 0 from the table, so we set p to prefix[2] + 1 = 1.
t: aaaabaabab p: abcde
t[9] = p[1], match length 1, increment both t & p.
t: aaaabaabab p: abcde
t[10] = p[2], match length 2, increment both t & p, but t is at the last position, so matching ends here without finding the pattern.
If you notice carefully above, you can see the same thing. To decrement p, we need to match pattern characters at least same number of times beforehand.
So search time complexity is 2n which translates into O(n).
If we take prefix suffix table construction time into account, then total time complexity would be O(m+n).