Solving the String Search problem in Python
By John Lekberg on November 15, 2020.
This week's post is about solving the "String Search" problem. String-searching algorithms have applications in a wide range of fields, including digital forensics and spam detection.
You will learn:
- How to create a brute force solution.
- How to improve the time complexity of the brute force solution by using the Knuth-Morris-Pratt algorithm.
- How to create a proxy object that allows you to use 1-based indexing (used by the Knuth-Morris-Pratt algorithm).
Problem statement
The goal of the "String Search" problem is to find the first index (using 0-based indexing) of a substring, pattern, in a larger string, text. E.g.
Find the first index of "o" in "hello world".
As input, you are given two strings, text and pattern.
As output, you must return the first index of pattern inside text. If pattern is not a substring of text, then return nothing.
E.g., here's a scenario with a match:
- Text is "hello world".
- Pattern is "o".
- The first index of pattern inside text is 4.
E.g., here's a scenario without a match:
- Text is "hello world".
- Pattern is "z".
- Pattern does not occur inside text.
In production code, please use the built-in method str.index instead of
creating your own string-searching algorithm. However, there are only
built-in methods for str objects and bytes objects: if you want to, e.g., find the first index of [2, 3]
within [1, 2, 3, 4, 5]
, then you'll need to use other code (write your own or use a library).
How I represent the data in the problem
For the input, I represent text and pattern as Sequence objects. In Python, a Sequence object can be a str object, a list object, a bytes object, a tuple object, etc. It represents a general "sequence" type. E.g. here are some Sequence objects:
"hello world"
(1, 2, 3, 4, 5)
["A♭maj⁷", "gm", "f♯o", "fm⁷", "E⁷", "E♭", "D⁷", "G⁷sus4", "G⁷"]
For the output, I represent the index as an int object (e.g. 7
). If there is no
match, then I use the special constant None.
Creating a brute force solution
A brute force solution to the "String Search" problem can iterate over every index in text and see if pattern matches at that index. If the match fails at a particular index, then I move onto the next index.
Here's Python code that does this:
def string_search_bf(*, text, pattern): """Find the first index of `pattern` in `text` using brute force. If nothing is found, then return None. text -- Sequence. E.g. "hello" or [1, 2, 3, 4]. pattern -- Sequence. E.g. "ll" or [2, 3]. """ n, m = len(text), len(pattern) for i in range(1 + (n - m)): match = True for j in range(m): if text[i + j] != pattern[j]: match = False break if match: return i string_search_bf(text="hello world", pattern="o")
4
What's the time complexity of this code? For n-element text and m-element pattern:
- I iterate over
i
O(n-m) times.- Within each iteration, I iterate over
j
O(m) times.
- Within each iteration, I iterate over
As a result, the overall time complexity of this algorithm is
O((n-m)m) time
Creating a faster solution using the Knuth-Morris-Pratt algorithm
For n-element text and m-element pattern, the Knuth-Morris-Pratt algorithm solves the "String Search" problem in
O(n+m) time
This is more efficient than the time complexity of the brute force algorithm,
O((n-m)m) time
The key idea of the Knuth-Morris-Pratt algorithm is to make use of previous partial matches. In the brute force algorithm, when a match fails, the algorithm backtracks text to the next index. But in the Knuth-Morris-Pratt algorithm, when a match fails, the algorithm doesn't backtrack text. Instead, the algorithm determines how to backtrack pattern to find the current partial match.
For more information on the Knuth-Morris-Pratt algorithm, see
- Knuth, Donald, Morris, James, Pratt, Vaughan. 1977. "Fast Pattern Matching in Strings." SIAM J. Comput., 6(2), 323–350. (28 pages) https://doi.org/10.1137/0206024
The algorithm outlined in this paper uses 1-based indexing.
Instead of modifying the algorithm to use 0-based indexing, I created a
proxy object, OneBased
, that allows me to use 1-based indexing on
sequences:
class OneBased: """A proxy object that allows sequences to use 1-based indexing.""" def __init__(self, seq): self.seq = seq def __getitem__(self, key): assert key >= 1 return self.seq[key - 1] def __len__(self): return len(self.seq) x = OneBased("hello") x[1]
'h'
x[5]
'o'
The class OneBased
implements the special methods __getitem__ and __len__.
I chose to use 1-based indexing because it makes it easier to directly copy the algorithm from the paper, which reduces the chances of me making a mistake. Converting from 1-based indexing to 0-based indexing is straightforward, but tedious. E.g. here is some pseudocode from the paper:
j≔1; t≔0; next[1]≔0
while j<m do
begin
while t>0 and pattern[j]≠pattern[t]
do t≔next[t];
t≔t+1; j≔j+1
if pattern[j]=pattern[t]
then next[j]≔next[t]
else next[j]≔t;
end;
...
j≔k≔1;
while j≦m and k≦n do
begin
while j>0 and text[k]≠pattern[j]
do j≔next[j];
k≔k+1; j≔j+1
end;
If I use 1-based indexing here's the corresponding Python code:
j = 1; t = 0; next = {1:0}
while j < m:
while t > 0 and pattern[j] != pattern[t]:
t = next[t]
t = t + 1; j = j + 1
if pattern[j] == pattern[t]:
next[j] = next[t]
else:
next[j] = t
...
j = k = 1
while j <= m and k <= n:
while j > 0 and text[k] != pattern[j]:
j = next[j]
k = k + 1; j = j + 1
But, if I use 0-based indexing here's the corresponding Python code:
j = 0; t = -1; next = {0: -1}
while j < m - 1:
while t > -1 and pattern[j] != pattern[t]:
t = next[t]
t = t + 1; j = j + 1
if pattern[j] == pattern[t]:
next[j] = next[t]
else:
next[j] = t
...
j = k = 0
while j <= m - 1 and k <= n - 1:
while j > -1 and text[k] != pattern[j]:
j = next[j]
k = k + 1; j = j + 1
The high-level structure is the same, but I made plenty of off-by-one errors while I was creating the 0-based indexing example.
Here's my Python implementation of the Knuth-Morris-Pratt algorithm:
def knuth_morris_pratt(*, text, pattern):
"""Run the Knuth-Morris-Pratt algorithm on
sequences `text` and `pattern`. The
sequences must use 1-based indexing, not
0-based indexing. Returns the 1-based index
of the first match, or returns None if no
match is found.
text -- OneBased. E.g. OneBased("hello") or
OneBased([1, 2, 3, 4]).
pattern - OneBased. E.g. OneBased("ll") or
OneBased([2, 3]).
For more information, see:
> Knuth, Donald, Morris, James, Pratt, Vaughan. 1977.
> "Fast Pattern Matching in Strings."
> SIAM J. Comput., 6(2), 323–350. (28 pages)
> https://doi.org/10.1137/0206024
"""
n, m = len(text), len(pattern)
# Compute the "next" table.
j = 1; t = 0; next = {1:0}
while j < m:
while t > 0 and pattern[j] != pattern[t]:
t = next[t]
t = t + 1; j = j + 1
if pattern[j] == pattern[t]:
next[j] = next[t]
else:
next[j] = t
# Pattern-match process.
j = k = 1
while j <= m and k <= n:
while j > 0 and text[k] != pattern[j]:
j = next[j]
k = k + 1; j = j + 1
if j > m:
return k - m
To use this to solve the "String Search" problem, all I need to do is account for the 1-based indexing:
def string_search_fast(*, text, pattern): """Find the first index of `pattern` in `text` using the Knuth-Morris-Pratt algorithm. If nothing is found, then return None. text -- Sequence. E.g. "hello" or [1, 2, 3, 4]. pattern -- Sequence. E.g. "ll" or [2, 3]. """ index = knuth_morris_pratt( text=OneBased(text), pattern=OneBased(pattern) ) if index is not None: return index - 1 string_search_fast(text="hello world", pattern="o")
4
In conclusion...
In this week's post, you learned how to create a brute force solution for the
"String Search" problem. You also learned how to implement the
Knuth-Morris-Pratt algorithm to create a more efficient solution. One of
the key ideas behind the Knuth-Morris-Pratt algorithm is to precompute a
data structure (e.g. the next
table) to improve the overall time
complexity of the algorithm. You also learned how to create a proxy
object, OneBased
, that allowed you to use 1-based indexing with
Sequence objects.
String-searching algorithms have applications in a wide variety of fields, such as digital forensics and spam detection. In a digital forensics project, you may want to search a computer's memory for the URLs, so you search for the substrings "http:" and "https:". In a spam detection project, you may want to search emails for key phrases that indicate spam, such as "Be your own boss", "Extra income", "No purchase necessary", or "Meet singles".
My challenge to you:
Modify the Python functions
knuth_morris_pratt
andstring_search_fast
to return a list of all matching indices. E.g.string_search_fast_all(text="hello world", pattern="o")
[4, 7]
It's your decision to count overlapping matches or not.
If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!
(If you spot any errors or typos on this post, contact me via my contact page.)