Return to Blog

Regular expression functions in Python

By John Lekberg on May 15, 2020.


This week's post is about different ways to use regular expressions (regexes) in Python. You will learn:

What are regexes?

Regular expressions (regexes) are textual search patterns. E.g

Python supports regexes with the re module. Read the module documentation for more information on the types of regexes that Python supports.

Splitting strings with regexes

Python strings have the str.split method to split a string:

"a,b,c".split(",")
['a', 'b', 'c']

But str.split is limited to using fixed string patterns. This means that it would be hard to turn

"a,b,c;d,e,f"

into

['a', 'b', 'c', 'd', 'e', 'f']

Here's how I would use str.split:

text = "a,b,c;d,e,f"
result = []
for x in text.split(";"):
    result.extend(x.split(","))
result
['a', 'b', 'c', 'd', 'e', 'f']

However, it's much easier for me to use re.split with the regex [;,] that matches both ";" and ",":

import re

re.split("[;,]", "a,b,c;d,e,f")
['a', 'b', 'c', 'd', 'e', 'f']

Finding patterns with regexes

I have a secret message that intercepted:

secret_message = """
Message Date: DATA[2004-09-03].

This is a top secret message from the U.S. Government
about Metal Gear RAY (DATA[610d19f8-9d33-4927-9d30-
22ea4c546071]).

The rendezvous coordinates are DATA[
    35.89421911
    139.94637467
].
"""

I want to see if the message mentions "metal gear", so I use re.search:

re.search(r"(?i)metal\s+gear", secret_message)
<re.Match object; span=(94, 104), match='Metal Gear'>

I also want to find all the parts of the message that look like DATA[...] (e.g. DATA[2004-09-03]), so I use re.findall:

re.findall(r"(?sm)DATA\[(.*?)\]", secret_message)
['2004-09-03',
 '610d19f8-9d33-4927-9d30-\n22ea4c546071',
 '\n    35.89421911\n    139.94637467\n']

re.search attempts to find a pattern in the text and return a match.

re.search("(?i)gear", "METAL GEAR RAY")
<re.Match object; span=(6, 10), match='GEAR'>

There are two other functions related to re.search: re.match and re.fullmatch. re.match only looks for the pattern at the beginning of the text:

re.search("y", "xyz")
<re.Match object; span=(1, 2), match='y'>
re.match("y", "xyz")
None

re.fullmatch only matches a pattern that covers the entire text:

re.search("xy", "xyz")
<re.Match object; span=(0, 2), match='xy'>
re.fullmatch("xy", "xyz")
None
re.fullmatch("xy.*", "xyz")
<re.Match object; span=(0, 3), match='xyz'>

re.match and re.fullmatch are useful shortcuts when you want to constrain your matching and don't want to waste effort searching other parts of the text.


re.findall attempts to find a pattern in text and return a list of all matches.

re.findall(r"\d", "a b c 1 2 3 d e f 4 5 6")
['1', '2', '3', '4', '5', '6']

There is a related function: re.finditer. re.finditer iterates over matches, instead of returning a list:

re.finditer(r"\d", "a b c 1 2 3 d e f 4 5 6")
<callable_iterator at 0x109e8d950>
for x in re.finditer(r"\d", "a b c 1 2 3 d e f 4 5 6"):
    print(x)
<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(8, 9), match='2'>
<re.Match object; span=(10, 11), match='3'>
<re.Match object; span=(18, 19), match='4'>
<re.Match object; span=(20, 21), match='5'>
<re.Match object; span=(22, 23), match='6'>

re.finditer is useful when you only need to process the matches one at a time.

Finding and replacing patterns with regexes

I want to redact the secret message from earlier:

secret_message = """
Message Date: DATA[2004-09-03].

This is a top secret message from the U.S. Government
about Metal Gear RAY (DATA[610d19f8-9d33-4927-9d30-
22ea4c546071]).

The rendezvous coordinates are DATA[
    35.89421911
    139.94637467
].
"""

I want to redact the DATA[...] blocks, so I use re.sub:

redacted_message = re.sub(
    r"(?ms)DATA\[.*?\]",
    "[REDACTED]",
    secret_message,
)
print(redacted_message)
Message Date: [REDACTED].

This is a top secret message from the U.S. Government
about Metal Gear RAY ([REDACTED]).

The rendezvous coordinates are [REDACTED].

re.sub replaces regexe matches in text. The replacement can be a string:

re.sub(
    r"(?ms)DATA\[.*?\]",
    "[REDACTED]",
    "Message Date: DATA[2004-09-03]."
)
'Message Date: [REDACTED].'

The replacement can also be a function, for more advanced replacements. I may want to change the date mentioned in the secret message, but not redact it. This would give false information to anyone that intercepted the message, without raising too much suspicion. Here's a function that "masks" a date:

import datetime
import random

def mask_date(ymd):
    """Replace a date with another date, to
    throw off anyone intercepting the message.
    
    ymd -- a YYYY-MM-DD string.
    """
    date = datetime.datetime.strptime(ymd, "%Y-%m-%d")
    shift = datetime.timedelta(days=random.randint(3, 24))
    date -= shift
    return date.strftime("%Y-%m-%d")

mask_date("2020-04-05")
'2020-03-18'
mask_date("2020-04-05")
'2020-03-30'

I use re.sub to mask the date in the secret message:

secret_message = """
Message Date: DATA[2004-09-03].

This is a top secret message from the U.S. Government
about Metal Gear RAY (DATA[610d19f8-9d33-4927-9d30-
22ea4c546071]).

The rendezvous coordinates are DATA[
    35.89421911
    139.94637467
].
"""
masked_message = re.sub(
    "....-..-..",
    lambda m: mask_date(m[0]),
    secret_message
)
print(masked_message)
Message Date: DATA[2004-08-31].

This is a top secret message from the U.S. Government
about Metal Gear RAY (DATA[610d19f8-9d33-4927-9d30-
22ea4c546071]).

The rendezvous coordinates are DATA[
    35.89421911
    139.94637467
].

In conclusion...

In this week's post you learned how to use regexes to split strings, find patterns, and find-and-replace. Learning how to use regexes is a transferable skill, because many different tools support regexes.

My challenge to you:

Consider this HTML text:

<h1
    >Building a command line tool to compute Elo ratings</h1>
 <p
    >By John Lekberg on May 01, 2020.</p>
 <hr/>
 <p>This week's post will cover building a command line
 tool that computes <a
 href="https://en.wikipedia.org/wiki/Elo_rating_system">Elo
 ratings</a>.
    You will learn:
 </p>
 <ul>
    <li>How to calculate <a
    href="https://en.wikipedia.org/wiki/Elo_rating_system">Elo
    ratings</a> in
    Python.</li>
    <li>How to use casefolding to make case-insensitive
    comparisons.</li>
    <li>How to have <a
    href="https://docs.python.org/3/library/argparse.html">
    argparse</a> validate command line options. (E.g.
    checking that a
    number is positive.)</li>
 </ul>

Use the regex functions you learned about to strip out the HTML tags and turn the text into something like this:

Building a command line tool to compute Elo ratings
By John Lekberg on May 01, 2020.

This week's post will cover building a command line tool
that computes Elo ratings.
You will learn:


How to calculate Elo ratings in
Python.
How to use casefolding to make case-insensitive comparisons.
How to have argparse
validate command line options. (E.g. checking that a number is
positive.)

If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!


(If you spot any errors or typos on this post, contact me via my contact page.)