Using regular expression flags in Python
By John Lekberg on March 11, 2020.
This week's post is about regular expression (regex) flags. You will learn how to use regex flags to:
- Add comments to your regular expressions.
- Do case-insensitive matching.
- Allow patterns to match specific lines instead of the whole text.
- Match patterns spanning over multiple lines.
You need a basic understanding of regexes to read this post. If you want to learn the basics of regexes, read this tutorial:
Python's built-in regex module is re.
What are regex flags?
Regex flags allow useful regex features to be turned on. E.g.
Allow case-insensitive matching so that "dave" is treated the same as "Dave".
4 useful regex flags in Python are:
- VERBOSE. Allow inline comments and extra whitespace.
- IGNORECASE. Do case-insensitive matches.
- MULTILINE.
Allow anchors (
^
and$
) to match the beginnings and ends of lines instead of matching the beginning and end of the whole text. - DOTALL.
Allow dot (
.
) to match any character, including a newline. (The default behavior of dot is to match anything, except for a newline.)
How can I use regex flags?
Each regex flag can be activated in three different ways:
- Activated with the long argument name (e.g.
re.IGNORECASE
). - Activated with the short argument name (e.g.
re.I
). - Activated with the inline name (e.g.
"(?i)"
).
long | short | inline |
---|---|---|
re.VERBOSE | re.X | "(?x)"
|
re.IGNORECASE | re.I | "(?i)"
|
re.MULTILINE | re.M | "(?m)"
|
re.DOTALL | re.S | "(?s)"
|
To use short and long argument names, you pass them as arguments to re.compile, re.search, re.match, re.fullmatch, re.split, re.findall, re.finditer, re.sub, and re.subn. E.g.
import re re.match("dave", "Dave")
None
re.match("dave", "Dave", flags=re.IGNORECASE)
<re.Match object; span=(0, 4), match='Dave'>
re.match("dave", "Dave", flags=re.I)
<re.Match object; span=(0, 4), match='Dave'>
re.findall("dave", "my friend Dave is named dave.", flags=re.I)
['Dave', 'dave']
Using short and long arguments, flags can be combined using the operator |
. E.g.
text = """ Dave is my friend. dave is named dave. Dave is dave? """ re.findall("^dave", text, flags=re.I)
[]
re.findall("^dave", text, flags=re.M)
['dave']
re.findall("^dave", text, flags=re.I | re.M)
['Dave', 'dave', 'Dave']
To use inline flag names, include them in the regex:
re.match("(?i)dave", "Dave")
<re.Match object; span=(0, 4), match='Dave'>
There are two ways to use inline flag names:
- Globally, which turns the flag on for the entire regex.
- Locally, which turns the flag on or off for part of a regex.
To use an inline flag name globally, write it like "(?i)"
and include it at
the beginning of the regex:
re.match("(?i)dave", "Dave")
<re.Match object; span=(0, 4), match='Dave'>
re.match("dave(?i)", "Dave")
DeprecationWarning: Flags not at the start of the expression 'dave(?i)'
<re.Match object; span=(0, 4), match='Dave'>
To use an inline flag name locally, write it like "(?i:...)"
instead of "(?i)..."
. E.g.
re.match("hello (?i:dave)", "HELLO Dave")
None
re.match("hello (?i:dave)", "hello Dave")
<re.Match object; span=(0, 10), match='hello Dave'>
To turn a local flag off, write it like "(?-i:...)"
instead of "(?i:...)"
. E.g.
re.match("(?i)hello (?-i:there) dave", "HELLO THERE DAVE")
None
re.match("(?i)hello (?-i:there) dave", "HELLO there DAVE")
<re.Match object; span=(0, 16), match='HELLO there DAVE'>
You can write multiple inline flags like "(?i)(?m)..."
, and you can also
combine them like "(?im)..."
. E.g.
text = """ Dave is my friend. dave is named dave. Dave is dave? """ re.findall("(?i)(?m)^dave", text)
['Dave', 'dave', 'Dave']
re.findall("(?im)^dave", text)
['Dave', 'dave', 'Dave']
The regex flag VERBOSE
re.VERBOSE
, re.X
, "(?x)"
The VERBOSE flag allows inline comments and extra whitespace. E.g.
pattern = """(?x) from [ ]+ [0-9:]+ # start time [ ]+ to [ ]+ [0-9:]+ # end time """ re.search(pattern, "Event: Lunch from 10 to 11")
<re.Match object; span=(13, 31), match='from 10 to 11'>
If you want to match whitespace, you must explicitly denote it using "[ ]"
or "\\t"
.
(See "[ ]+"
in the above pattern.)
The benefit of using the VERBOSE flag is that you can create regexes that are more readable and easier to maintain for you and your coworkers. E.g.
-
Compare this regex
"M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})"
to an equivalent regex that uses the VERBOSE flag:
"""(?x) M{0,4} ( CM | CD | D?C{0,3} ) ( XC | XL | L?X{0,3} ) ( IX | IV | V?I{0,3} ) """
The regex flag IGNORECASE
re.IGNORECASE
, re.I
, "(?i)"
The IGNORECASE flag makes all matching case-insensitive. E.g.
sql_code = """ SELECT Students.name FROM Packages P1 Inner Join Friends on Friends.id = P1.id INNER JOIN Packages P2 On P2.id = Friends.friend_id join join Students ON Students.id = P1.id WHERE P2.salary > P1.salary order BY P2.salary ; """ sql_keywords = "(?i)select|from|inner join|on|where|order by" re.findall(sql_keywords, sql_code)
['SELECT',
'FROM',
'Inner Join',
'on',
'INNER JOIN',
'On',
'ON',
'WHERE',
'order BY']
The IGNORECASE flag is useful when the pattern that you are searching for may or may not be capitalized or not. E.g.
-
When you search text for mentions of your friend Dave, you want to match "dave" and "Dave". You use the regex
"(?i)dave"
-
You are searching SQL code for mentions of a table named "Employee_Information". Because SQL is case-insensitive, this could be written as "EMPLOYEE_INFORMATION", "employee_information", and any other variation. You use the regex
"(?i)employee_information"
The regex flag MULTILINE
re.MULTILINE
, re.M
, "(?m)"
The MULTILINE flag allows anchors (^
and $
) to match the
beginnings and ends of lines instead of matching the beginning and end of the
whole text.
python_code = """\ def f(x): return x + 4 class Dog: def bark(self): print("bark") """ python_function = "^[ ]*def \w+" re.findall(python_function, python_code, flags=re.MULTILINE)
['def f', ' def bark']
Without using the MULTILINE flag, only "def f"
would match:
re.findall(python_function, python_code)
['def f']
The MULTILINE flag is useful when the pattern that you are searching for looks at the beginning of a line (or at the end of a line). E.g.
-
You want to find all lines in a code file that begin with "def", so you use the regex
"(?m)^def"
-
You want to find all lines in a code file that don't end with a semicolon, so you use the regex
"(?m)[^;]$"
The regex flag DOTALL
re.DOTALL
, re.S
, "(?s)"
.
The DOTALL flag allows dot (.
) to match any character,
including a newline.
(The default behavior of dot is to match anything, except for a newline.)
E.g.
secret_message = """ Message Date: DATA[2004-09-03]. This is a top secret message from the U.S. Government about Metal Gear RAY (DATA[610d19f8-9d33-4927-9d30- 22ea4c546071]). The rendezvous coordinates are DATA[ 35.89421911 139.94637467 ]. """ data_blob = "DATA\[.*?\]" re.findall(data_blob, secret_message, flags=re.DOTALL)
['DATA[2004-09-03]',
'DATA[610d19f8-9d33-4927-9d30-\n22ea4c546071]',
'DATA[\n 35.89421911\n 139.94637467\n]']
Without using the DOTALL flag, only data blobs that fit on one line would match:
re.findall(data_blob, secret_message)
['DATA[2004-09-03]']
The DOTALL flag is useful when the pattern that you are searching for may span across multiple lines. E.g.
-
You are searching through XML data and want to find CDATA sections. CDATA sections start with
<![CDATA[
, end with]]>
, and can span multiple lines. You use the regex"(?s)<!\[CDATA\[.*?\]\]>"
In conclusion...
In this article, you learned how to use regex flags to improve your regexes. Regex flags are features that can be turned on to allow for things like case-insensitive matching and the ability to add comments to your regex. They allow your to write regexes like this
"(?i)hello world"
Instead of regexes like this
"[Hh][Ee][Ll][Ll][Oo] [Ww][Oo][Rr][Ll][Dd]"
My challenge to you:
Create your own regular expressions that use the regex flags that you learned about today: VERBOSE, IGNORECASE, MULTILINE, and DOTALL.
If you enjoyed this post, let me know. Share this with your friends and stay tuned for next week's post. See you then!
(If you spot any errors or typos on this post, contact me via my contact page.)