# Solving the String Search problem in Python

By John Lekberg on November 15, 2020.

This week's post is about solving the "String Search" problem. String-searching algorithms have applications in a wide range of fields, including digital forensics and spam detection.

You will learn:

- How to create a brute force solution.
- How to improve the time complexity of the brute force solution by using the Knuth-Morris-Pratt algorithm.
- How to create a proxy object that allows you to use 1-based indexing (used by the Knuth-Morris-Pratt algorithm).

# Problem statement

The goal of the "String Search" problem is to find the first index (using
0-based indexing) of a substring, **pattern**, in a larger string,
**text.** E.g.

Find the first index of "o" in "hello world".

As input, you are given two strings, **text** and **pattern.**

As output, you must return the first index of **pattern** inside **text**. If
**pattern** is not a substring of **text**, then return nothing.

E.g., here's a scenario with a match:

**Text**is "hello world".**Pattern**is "o".- The first index of
**pattern**inside**text**is 4.

E.g., here's a scenario without a match:

**Text**is "hello world".**Pattern**is "z".**Pattern**does not occur inside**text.**

**In production code,** please use the built-in method str.index instead of
creating your own string-searching algorithm. However, there are only
built-in methods for str objects and bytes objects: if you want to, e.g., find the first index of `[2, 3]`

within `[1, 2, 3, 4, 5]`

, then you'll need to use other code (write your own or use a library).

# How I represent the data in the problem

For the input, I represent **text** and **pattern** as Sequence
objects. In Python, a Sequence object can be a str
object, a list object, a bytes object, a tuple object, etc.
It represents a general "sequence" type. E.g. here are some Sequence
objects:

`"hello world"`

`(1, 2, 3, 4, 5)`

`["A♭maj⁷", "gm", "f♯o", "fm⁷", "E⁷", "E♭", "D⁷", "G⁷sus4", "G⁷"]`

For the output, I represent the index as an int object (e.g. `7`

). If there is no
match, then I use the special constant None.

# Creating a brute force solution

A brute force solution to the "String Search" problem can iterate over
every index in **text** and see if **pattern** matches at that index. If the
match fails at a particular index, then I move onto the next index.

Here's Python code that does this:

`def string_search_bf(*, text, pattern): """Find the first index of `pattern` in `text` using brute force. If nothing is found, then return None. text -- Sequence. E.g. "hello" or [1, 2, 3, 4]. pattern -- Sequence. E.g. "ll" or [2, 3]. """ n, m = len(text), len(pattern) for i in range(1 + (n - m)): match = True for j in range(m): if text[i + j] != pattern[j]: match = False break if match: return i string_search_bf(text="hello world", pattern="o")`

```
4
```

What's the time complexity of this code? For `n`-element
**text** and `m`-element **pattern:**

- I iterate over
`i`

O(`n`-`m`) times.- Within each iteration, I iterate over
`j`

O(`m`) times.

- Within each iteration, I iterate over

As a result, the overall time complexity of this algorithm is

O((

n-m)m) time

# Creating a faster solution using the Knuth-Morris-Pratt algorithm

For `n`-element **text** and `m`-element **pattern**, the
Knuth-Morris-Pratt algorithm solves the "String Search" problem in

O(

n+m) time

This is more efficient than the time complexity of the brute force algorithm,

O((

n-m)m) time

The key idea of the Knuth-Morris-Pratt algorithm is to **make use of
previous partial matches.** In the brute force algorithm, when a match fails, the algorithm backtracks **text** to the next
index. But in the Knuth-Morris-Pratt algorithm, when a match fails, the
algorithm doesn't backtrack **text**. Instead, the algorithm determines how to
backtrack **pattern** to find the current partial match.

For more information on the Knuth-Morris-Pratt algorithm, see

- Knuth, Donald, Morris, James, Pratt, Vaughan. 1977. "Fast Pattern Matching in Strings." SIAM J. Comput., 6(2), 323–350. (28 pages) https://doi.org/10.1137/0206024

The algorithm outlined in this paper uses 1-based indexing.
Instead of modifying the algorithm to use 0-based indexing, I created a
proxy object, `OneBased`

, that allows me to use 1-based indexing on
sequences:

`class OneBased: """A proxy object that allows sequences to use 1-based indexing.""" def __init__(self, seq): self.seq = seq def __getitem__(self, key): assert key >= 1 return self.seq[key - 1] def __len__(self): return len(self.seq) x = OneBased("hello") x[1]`

```
'h'
```

`x[5]`

```
'o'
```

The class `OneBased`

implements the special methods __getitem__ and __len__.

I chose to use 1-based indexing because it makes it easier to directly copy the algorithm from the paper, which reduces the chances of me making a mistake. Converting from 1-based indexing to 0-based indexing is straightforward, but tedious. E.g. here is some pseudocode from the paper:

```
j≔1; t≔0; next[1]≔0
while j<m do
begin
while t>0 and pattern[j]≠pattern[t]
do t≔next[t];
t≔t+1; j≔j+1
if pattern[j]=pattern[t]
then next[j]≔next[t]
else next[j]≔t;
end;
...
j≔k≔1;
while j≦m and k≦n do
begin
while j>0 and text[k]≠pattern[j]
do j≔next[j];
k≔k+1; j≔j+1
end;
```

If I use 1-based indexing here's the corresponding Python code:

```
j = 1; t = 0; next = {1:0}
while j < m:
while t > 0 and pattern[j] != pattern[t]:
t = next[t]
t = t + 1; j = j + 1
if pattern[j] == pattern[t]:
next[j] = next[t]
else:
next[j] = t
...
j = k = 1
while j <= m and k <= n:
while j > 0 and text[k] != pattern[j]:
j = next[j]
k = k + 1; j = j + 1
```

But, if I use 0-based indexing here's the corresponding Python code:

```
j = 0; t = -1; next = {0: -1}
while j < m - 1:
while t > -1 and pattern[j] != pattern[t]:
t = next[t]
t = t + 1; j = j + 1
if pattern[j] == pattern[t]:
next[j] = next[t]
else:
next[j] = t
...
j = k = 0
while j <= m - 1 and k <= n - 1:
while j > -1 and text[k] != pattern[j]:
j = next[j]
k = k + 1; j = j + 1
```

The high-level structure is the same, but I made plenty of off-by-one errors while I was creating the 0-based indexing example.

Here's my Python implementation of the Knuth-Morris-Pratt algorithm:

```
def knuth_morris_pratt(*, text, pattern):
"""Run the Knuth-Morris-Pratt algorithm on
sequences `text` and `pattern`. The
sequences must use 1-based indexing, not
0-based indexing. Returns the 1-based index
of the first match, or returns None if no
match is found.
text -- OneBased. E.g. OneBased("hello") or
OneBased([1, 2, 3, 4]).
pattern - OneBased. E.g. OneBased("ll") or
OneBased([2, 3]).
For more information, see:
> Knuth, Donald, Morris, James, Pratt, Vaughan. 1977.
> "Fast Pattern Matching in Strings."
> SIAM J. Comput., 6(2), 323–350. (28 pages)
> https://doi.org/10.1137/0206024
"""
n, m = len(text), len(pattern)
# Compute the "next" table.
j = 1; t = 0; next = {1:0}
while j < m:
while t > 0 and pattern[j] != pattern[t]:
t = next[t]
t = t + 1; j = j + 1
if pattern[j] == pattern[t]:
next[j] = next[t]
else:
next[j] = t
# Pattern-match process.
j = k = 1
while j <= m and k <= n:
while j > 0 and text[k] != pattern[j]:
j = next[j]
k = k + 1; j = j + 1
if j > m:
return k - m
```

To use this to solve the "String Search" problem, all I need to do is account for the 1-based indexing:

`def string_search_fast(*, text, pattern): """Find the first index of `pattern` in `text` using the Knuth-Morris-Pratt algorithm. If nothing is found, then return None. text -- Sequence. E.g. "hello" or [1, 2, 3, 4]. pattern -- Sequence. E.g. "ll" or [2, 3]. """ index = knuth_morris_pratt( text=OneBased(text), pattern=OneBased(pattern) ) if index is not None: return index - 1 string_search_fast(text="hello world", pattern="o")`

```
4
```

# In conclusion...

In this week's post, you learned how to create a brute force solution for the
"String Search" problem. You also learned how to implement the
Knuth-Morris-Pratt algorithm to create a more efficient solution. One of
the key ideas behind the Knuth-Morris-Pratt algorithm is to precompute a
data structure (e.g. the `next`

table) to improve the overall time
complexity of the algorithm. You also learned how to create a proxy
object, `OneBased`

, that allowed you to use 1-based indexing with
Sequence objects.

String-searching algorithms have applications in a wide variety of fields, such as digital forensics and spam detection. In a digital forensics project, you may want to search a computer's memory for the URLs, so you search for the substrings "http:" and "https:". In a spam detection project, you may want to search emails for key phrases that indicate spam, such as "Be your own boss", "Extra income", "No purchase necessary", or "Meet singles".

My challenge to you:

Modify the Python functions

`knuth_morris_pratt`

and`string_search_fast`

to return a list of all matching indices. E.g.`string_search_fast_all(text="hello world", pattern="o")`

`[4, 7]`

It's your decision to count overlapping matches or not.

If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!

(If you spot any errors or typos on this post, contact me via my contact page.)