Regular expression functions in Python
By John Lekberg on May 15, 2020.
This week's post is about different ways to use regular expressions (regexes) in Python. You will learn:
- What regexes are.
- How to split strings with regexes.
- How to find patterns with regexes.
- How to find-and-replace patterns with regexes.
What are regexes?
Regular expressions (regexes) are textual search patterns. E.g
- The regex
cat
matches the text "cat". - The regex
c[auo]t
matches the text "cat", "cut", and "cot".
Python supports regexes with the re module. Read the module documentation for more information on the types of regexes that Python supports.
Splitting strings with regexes
Python strings have the str.split method to split a string:
"a,b,c".split(",")
['a', 'b', 'c']
But str.split
is limited to using fixed string patterns.
This means that it would be hard to turn
"a,b,c;d,e,f"
into
['a', 'b', 'c', 'd', 'e', 'f']
Here's how I would use str.split
:
text = "a,b,c;d,e,f" result = [] for x in text.split(";"): result.extend(x.split(",")) result
['a', 'b', 'c', 'd', 'e', 'f']
However, it's much easier for me to use re.split with the regex
[;,]
that matches both ";" and ",":
import re re.split("[;,]", "a,b,c;d,e,f")
['a', 'b', 'c', 'd', 'e', 'f']
Finding patterns with regexes
I have a secret message that intercepted:
secret_message = """
Message Date: DATA[2004-09-03].
This is a top secret message from the U.S. Government
about Metal Gear RAY (DATA[610d19f8-9d33-4927-9d30-
22ea4c546071]).
The rendezvous coordinates are DATA[
35.89421911
139.94637467
].
"""
I want to see if the message mentions "metal gear", so I use re.search:
re.search(r"(?i)metal\s+gear", secret_message)
<re.Match object; span=(94, 104), match='Metal Gear'>
I also want to find all the parts of the message that look like DATA[...]
(e.g. DATA[2004-09-03]
), so I use re.findall:
re.findall(r"(?sm)DATA\[(.*?)\]", secret_message)
['2004-09-03',
'610d19f8-9d33-4927-9d30-\n22ea4c546071',
'\n 35.89421911\n 139.94637467\n']
re.search attempts to find a pattern in the text and return a match.
re.search("(?i)gear", "METAL GEAR RAY")
<re.Match object; span=(6, 10), match='GEAR'>
There are two other functions related to re.search
: re.match and re.fullmatch.
re.match
only looks for the pattern at the beginning of the text:
re.search("y", "xyz")
<re.Match object; span=(1, 2), match='y'>
re.match("y", "xyz")
None
re.fullmatch
only matches a pattern that covers the entire text:
re.search("xy", "xyz")
<re.Match object; span=(0, 2), match='xy'>
re.fullmatch("xy", "xyz")
None
re.fullmatch("xy.*", "xyz")
<re.Match object; span=(0, 3), match='xyz'>
re.match
and re.fullmatch
are useful shortcuts when you want to constrain
your matching and don't want to waste effort searching other parts of the text.
re.findall attempts to find a pattern in text and return a list of all matches.
re.findall(r"\d", "a b c 1 2 3 d e f 4 5 6")
['1', '2', '3', '4', '5', '6']
There is a related function: re.finditer.
re.finditer
iterates over matches, instead of returning a list:
re.finditer(r"\d", "a b c 1 2 3 d e f 4 5 6")
<callable_iterator at 0x109e8d950>
for x in re.finditer(r"\d", "a b c 1 2 3 d e f 4 5 6"): print(x)
<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(8, 9), match='2'>
<re.Match object; span=(10, 11), match='3'>
<re.Match object; span=(18, 19), match='4'>
<re.Match object; span=(20, 21), match='5'>
<re.Match object; span=(22, 23), match='6'>
re.finditer
is useful when you only need to process the matches one at a time.
Finding and replacing patterns with regexes
I want to redact the secret message from earlier:
secret_message = """
Message Date: DATA[2004-09-03].
This is a top secret message from the U.S. Government
about Metal Gear RAY (DATA[610d19f8-9d33-4927-9d30-
22ea4c546071]).
The rendezvous coordinates are DATA[
35.89421911
139.94637467
].
"""
I want to redact the DATA[...]
blocks, so I use re.sub:
redacted_message = re.sub( r"(?ms)DATA\[.*?\]", "[REDACTED]", secret_message, ) print(redacted_message)
Message Date: [REDACTED].
This is a top secret message from the U.S. Government
about Metal Gear RAY ([REDACTED]).
The rendezvous coordinates are [REDACTED].
re.sub replaces regexe matches in text. The replacement can be a string:
re.sub( r"(?ms)DATA\[.*?\]", "[REDACTED]", "Message Date: DATA[2004-09-03]." )
'Message Date: [REDACTED].'
The replacement can also be a function, for more advanced replacements. I may want to change the date mentioned in the secret message, but not redact it. This would give false information to anyone that intercepted the message, without raising too much suspicion. Here's a function that "masks" a date:
import datetime import random def mask_date(ymd): """Replace a date with another date, to throw off anyone intercepting the message. ymd -- a YYYY-MM-DD string. """ date = datetime.datetime.strptime(ymd, "%Y-%m-%d") shift = datetime.timedelta(days=random.randint(3, 24)) date -= shift return date.strftime("%Y-%m-%d") mask_date("2020-04-05")
'2020-03-18'
mask_date("2020-04-05")
'2020-03-30'
I use re.sub
to mask the date in the secret message:
secret_message = """ Message Date: DATA[2004-09-03]. This is a top secret message from the U.S. Government about Metal Gear RAY (DATA[610d19f8-9d33-4927-9d30- 22ea4c546071]). The rendezvous coordinates are DATA[ 35.89421911 139.94637467 ]. """ masked_message = re.sub( "....-..-..", lambda m: mask_date(m[0]), secret_message ) print(masked_message)
Message Date: DATA[2004-08-31].
This is a top secret message from the U.S. Government
about Metal Gear RAY (DATA[610d19f8-9d33-4927-9d30-
22ea4c546071]).
The rendezvous coordinates are DATA[
35.89421911
139.94637467
].
In conclusion...
In this week's post you learned how to use regexes to split strings, find patterns, and find-and-replace. Learning how to use regexes is a transferable skill, because many different tools support regexes.
My challenge to you:
Consider this HTML text:
<h1 >Building a command line tool to compute Elo ratings</h1> <p >By John Lekberg on May 01, 2020.</p> <hr/> <p>This week's post will cover building a command line tool that computes <a href="https://en.wikipedia.org/wiki/Elo_rating_system">Elo ratings</a>. You will learn: </p> <ul> <li>How to calculate <a href="https://en.wikipedia.org/wiki/Elo_rating_system">Elo ratings</a> in Python.</li> <li>How to use casefolding to make case-insensitive comparisons.</li> <li>How to have <a href="https://docs.python.org/3/library/argparse.html"> argparse</a> validate command line options. (E.g. checking that a number is positive.)</li> </ul>
Use the regex functions you learned about to strip out the HTML tags and turn the text into something like this:
Building a command line tool to compute Elo ratings By John Lekberg on May 01, 2020. This week's post will cover building a command line tool that computes Elo ratings. You will learn: How to calculate Elo ratings in Python. How to use casefolding to make case-insensitive comparisons. How to have argparse validate command line options. (E.g. checking that a number is positive.)
If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!
(If you spot any errors or typos on this post, contact me via my contact page.)