Using regular expression flags in Python

By John Lekberg on March 11, 2020.

This week's post is about regular expression (regex) flags. You will learn how to use regex flags to:

Add comments to your regular expressions.
Do case-insensitive matching.
Allow patterns to match specific lines instead of the whole text.
Match patterns spanning over multiple lines.

You need a basic understanding of regexes to read this post. If you want to learn the basics of regexes, read this tutorial:

"Regular Expression HOWTO" A.M. Kuchling

Python's built-in regex module is re.

What are regex flags?

Regex flags allow useful regex features to be turned on. E.g.

Allow case-insensitive matching so that "dave" is treated the same as "Dave".

4 useful regex flags in Python are:

VERBOSE. Allow inline comments and extra whitespace.
IGNORECASE. Do case-insensitive matches.
MULTILINE. Allow anchors (^ and $) to match the beginnings and ends of lines instead of matching the beginning and end of the whole text.
DOTALL. Allow dot (.) to match any character, including a newline. (The default behavior of dot is to match anything, except for a newline.)

How can I use regex flags?

Each regex flag can be activated in three different ways:

Activated with the long argument name (e.g. re.IGNORECASE).
Activated with the short argument name (e.g. re.I).
Activated with the inline name (e.g. "(?i)").

long	short	inline
`re.VERBOSE`	`re.X`	`"(?x)"`
`re.IGNORECASE`	`re.I`	`"(?i)"`
`re.MULTILINE`	`re.M`	`"(?m)"`
`re.DOTALL`	`re.S`	`"(?s)"`

To use short and long argument names, you pass them as arguments to re.compile, re.search, re.match, re.fullmatch, re.split, re.findall, re.finditer, re.sub, and re.subn. E.g.

import re

re.match("dave", "Dave")

None

re.match("dave", "Dave", flags=re.IGNORECASE)

<re.Match object; span=(0, 4), match='Dave'>

re.match("dave", "Dave", flags=re.I)

<re.Match object; span=(0, 4), match='Dave'>

re.findall("dave", "my friend Dave is named dave.", flags=re.I)

['Dave', 'dave']

Using short and long arguments, flags can be combined using the operator |. E.g.

text = """
Dave is my friend.
dave is named dave.
Dave is dave?
"""

re.findall("^dave", text, flags=re.I)

[]

re.findall("^dave", text, flags=re.M)

['dave']

re.findall("^dave", text, flags=re.I | re.M)

['Dave', 'dave', 'Dave']

To use inline flag names, include them in the regex:

re.match("(?i)dave", "Dave")

<re.Match object; span=(0, 4), match='Dave'>

There are two ways to use inline flag names:

Globally, which turns the flag on for the entire regex.
Locally, which turns the flag on or off for part of a regex.

To use an inline flag name globally, write it like "(?i)" and include it at the beginning of the regex:

re.match("(?i)dave", "Dave")

<re.Match object; span=(0, 4), match='Dave'>

re.match("dave(?i)", "Dave")

DeprecationWarning: Flags not at the start of the expression 'dave(?i)'
<re.Match object; span=(0, 4), match='Dave'>

To use an inline flag name locally, write it like "(?i:...)" instead of "(?i)...". E.g.

re.match("hello (?i:dave)", "HELLO Dave")

None

re.match("hello (?i:dave)", "hello Dave")

<re.Match object; span=(0, 10), match='hello Dave'>

To turn a local flag off, write it like "(?-i:...)" instead of "(?i:...)". E.g.

re.match("(?i)hello (?-i:there) dave", "HELLO THERE DAVE")

None

re.match("(?i)hello (?-i:there) dave", "HELLO there DAVE")

<re.Match object; span=(0, 16), match='HELLO there DAVE'>

You can write multiple inline flags like "(?i)(?m)...", and you can also combine them like "(?im)...". E.g.

text = """
Dave is my friend.
dave is named dave.
Dave is dave?
"""
re.findall("(?i)(?m)^dave", text)

['Dave', 'dave', 'Dave']

re.findall("(?im)^dave", text)

['Dave', 'dave', 'Dave']

The regex flag VERBOSE

re.VERBOSE, re.X, "(?x)"

The VERBOSE flag allows inline comments and extra whitespace. E.g.

pattern = """(?x)
from [ ]+ [0-9:]+  # start time
[ ]+
to [ ]+ [0-9:]+    # end time
"""
re.search(pattern, "Event: Lunch from     10 to  11")

<re.Match object; span=(13, 31), match='from     10 to  11'>

If you want to match whitespace, you must explicitly denote it using "[ ]" or "\\t". (See "[ ]+" in the above pattern.)

The benefit of using the VERBOSE flag is that you can create regexes that are more readable and easier to maintain for you and your coworkers. E.g.

Compare this regex

"M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})"

to an equivalent regex that uses the VERBOSE flag:

"""(?x)
M{0,4}
( CM | CD | D?C{0,3} )
( XC | XL | L?X{0,3} )
( IX | IV | V?I{0,3} )
"""

The regex flag IGNORECASE

re.IGNORECASE, re.I, "(?i)"

The IGNORECASE flag makes all matching case-insensitive. E.g.

sql_code = """
  SELECT Students.name
    FROM Packages P1
         Inner Join Friends
                    on Friends.id = P1.id
         INNER JOIN Packages P2
                    On P2.id = Friends.friend_id
         join join Students
                    ON Students.id = P1.id
   WHERE P2.salary > P1.salary
order BY P2.salary
;
"""

sql_keywords = "(?i)select|from|inner join|on|where|order by"

re.findall(sql_keywords, sql_code)

['SELECT',
 'FROM',
 'Inner Join',
 'on',
 'INNER JOIN',
 'On',
 'ON',
 'WHERE',
 'order BY']

The IGNORECASE flag is useful when the pattern that you are searching for may or may not be capitalized or not. E.g.

When you search text for mentions of your friend Dave, you want to match "dave" and "Dave". You use the regex
```
"(?i)dave"
```
You are searching SQL code for mentions of a table named "Employee_Information". Because SQL is case-insensitive, this could be written as "EMPLOYEE_INFORMATION", "employee_information", and any other variation. You use the regex
```
"(?i)employee_information"
```

The regex flag MULTILINE

re.MULTILINE, re.M, "(?m)"

The MULTILINE flag allows anchors (^ and $) to match the beginnings and ends of lines instead of matching the beginning and end of the whole text.

python_code = """\
def f(x):
    return x + 4
    
class Dog:
    def bark(self):
        print("bark")
"""

python_function = "^[ ]*def \w+"

re.findall(python_function, python_code, flags=re.MULTILINE)

['def f', '    def bark']

Without using the MULTILINE flag, only "def f" would match:

re.findall(python_function, python_code)

['def f']

The MULTILINE flag is useful when the pattern that you are searching for looks at the beginning of a line (or at the end of a line). E.g.

You want to find all lines in a code file that begin with "def", so you use the regex
```
"(?m)^def"
```
You want to find all lines in a code file that don't end with a semicolon, so you use the regex
```
"(?m)[^;]$"
```

The regex flag DOTALL

re.DOTALL, re.S, "(?s)".

The DOTALL flag allows dot (.) to match any character, including a newline. (The default behavior of dot is to match anything, except for a newline.) E.g.

secret_message = """
Message Date: DATA[2004-09-03].

This is a top secret message from the U.S. Government
about Metal Gear RAY (DATA[610d19f8-9d33-4927-9d30-
22ea4c546071]).

The rendezvous coordinates are DATA[
    35.89421911
    139.94637467
].
"""

data_blob = "DATA\[.*?\]"
re.findall(data_blob, secret_message, flags=re.DOTALL)

['DATA[2004-09-03]',
 'DATA[610d19f8-9d33-4927-9d30-\n22ea4c546071]',
 'DATA[\n    35.89421911\n    139.94637467\n]']

Without using the DOTALL flag, only data blobs that fit on one line would match:

re.findall(data_blob, secret_message)

['DATA[2004-09-03]']

The DOTALL flag is useful when the pattern that you are searching for may span across multiple lines. E.g.

You are searching through XML data and want to find CDATA sections. CDATA sections start with <![CDATA[, end with ]]>, and can span multiple lines. You use the regex
```
"(?s)<!\[CDATA\[.*?\]\]>"
```

In conclusion...

In this article, you learned how to use regex flags to improve your regexes. Regex flags are features that can be turned on to allow for things like case-insensitive matching and the ability to add comments to your regex. They allow your to write regexes like this

"(?i)hello world"

Instead of regexes like this

"[Hh][Ee][Ll][Ll][Oo] [Ww][Oo][Rr][Ll][Dd]"

My challenge to you:

Create your own regular expressions that use the regex flags that you learned about today: VERBOSE, IGNORECASE, MULTILINE, and DOTALL.

If you enjoyed this post, let me know. Share this with your friends and stay tuned for next week's post. See you then!

(If you spot any errors or typos on this post, contact me via my contact page.)