14  Regular Expressions


Goals


String Manipulation

In the past few decades modern computers have gotten exponentially more efficient at working with numerical data, a side effect of the demands of modern graphics. Libraries like NumPy and polars can take advantage of these optimizations, and libraries like pandas and shapely can benefit.

However, in many domains, a surprising amount of time is spent manipulating strings: consider how fundamental HTTP is, comparing strings, cleaning them, extracting information from a larger string.

While modern computers have gotten faster (able to perform more tasks per second) this has mostly been a linear improvement.

The challenge in part stems from the fact that unlike math, where we can parallelize a lot of operations, string manipulation is inherently sequential. It often involves iterating over something one character at a time, limiting our ability to optimize.

In Python, strings are immutable. Functions do not modify strings, they must return a new string.

Let’s what a simplified implementation of str.replace that can only replace a single character might look like:

def replace_one_char(self, from_ch, to_ch):
    # assume the existence of an internal array of characters
    new_chars = []
    for char in self._characters:
        if char == from_ch:
            new_chars.append(to_ch)
        else:
            new_chars.append(char)
    return "".join(new_chars)  # convert to a string

While the actual implementation would be in C, the algorithm itself can’t really be more efficient than this. It will need to both scan the entire string and make a copy of the string, O(N) in both time and space complexity.

O(N) is generally a lower bound on string manipulation. In practice, many string algorithms require multiple scans of the string and/or additional storage, consider the actual str.replace which needs to be able to account for runs of characters, not just a single character at a time.

This means we’ll be spending a lot of time between O(n) and O(n2). Some of the more complex string algorithms approach O(n3).

String Methods

Method(s) Description
str.upper(), str.lower() Convert to upper or lower case.
str.isupper(), str.islower() Check if all characters are upper or lower case.
str.strip(), str.lstrip(), str.rstrip() Remove white space from the beginning or end of a string.
str.replace() Replace all occurrences of a string with another string.
str.split() Split a string into a list of substrings.
str.startswith(), str.endswith() Check if a string starts or ends with a substring.
substr in str Check if a string contains a substring.
str.count() Count the number of occurrences of a substring.
str.find(), str.rfind() Find the index of the first or last occurrence of a substring.
str.index(),str.rindex() Find the index of the first or last occurrence of a substring.
str.isalpha(), str.isalnum(), str.isdigit(), etc. Check if all characters in a string are alphabetic, alphanumeric, digits, etc.

https://docs.python.org/3/library/string.html

Common String Manipulation Tasks

Operation Description Time Complexity
Comparing For equality or for sorting O(N)
Searching Finding a substring in a string O(N) to O(N+M),M is length of substring
Replacing Replacing a substring with another substring O(NM),M is number of replacements
Splitting Splitting a string into a list of substrings based on some character(s) O(N)
Validating/Matching Checking if a string is in a particular format O(NM),M is pattern complexity

Regular Expressions

Why Regular Expressions?

Not only are the string operations O(N), we often do far more of them than we realize:

text = "abcdefghijklmnopqrstuvwxyzMARIOabcdefghijklmnopqrstuvwxyzLUIGIabcdef"

# this contains two O(N) scans of the string, lower() and __contains__
if "mario" in text.lower():
    print("found mario!")
found mario!

And if we wanted to try to find two strings:

# now we're at four scans of the string
if "mario" in text.lower():
    print("found mario!")
if "luigi" in text.lower():
    print("found luigi!")
found mario!
found luigi!

We could cut this down by saving text.lower(), but if we wanted to search for lots of strings, we wind up repeating the O(N) operation many times.

searches = ("mario", "luigi", "peach")
lower_text = text.lower()
for search in searches:
    if search in lower_text:
        print(f"found {search}")
found mario
found luigi

We could in theory optimize this. We would need to write code that analyzed one character at a time and maintained detailed state:

# if either hits 5, word is found
mario_matches = 0
luigi_matches = 0

for ch in text:
    if ch == "m":
        mario_matches = 1
        luigi_matches = 0
    elif ch == "a":
        luigi_matches = 0
        if mario_matches == 1:
            mario_matches += 1
        else:
            mario_matches = 0
    elif ch == "i"
        # mar-i
        if mario_matches == 3:
            mario_matches += 1
        else:
            mario_matches = 0
        # lu-i or luig-i
        if luigi_matches == 2:
            luigi_matches += 1
        elif luigi_matches == 4:
            luigi_matches = 5 # FOUND FULL STRING!
   # etc...

Doing this for each variation of strings we’re searching for would be tedious & error-prone. Instead, we will turn to regular expressions.

Regular expressions are a notation for pattern-matching common to many programming languages. While each language has variations, the syntax is more alike than different.

Regular expressions are used in many different contexts:

  • Searching - Find all occurrences of a pattern in a string.
  • Validating - Check if a string matches a pattern. (e.g. phone numbers, email addresses, etc.)
  • Splitting/Extracting - Extract information from a string based on a pattern.

Examples

Pattern Explanation Example Matches
pies? Match the word “pie” or “pies” “pie”, “pies”
c[aou]t Match words words that start and end with c & t, and have a/o/u in the middle. “cat”, “cot”, “cut”
\d{3}-\d{3}-\d{4} Match a phone number in dashed format. “123-456-7890”
[A-Z][a-z]+, [A-Z]{2} Match a city name in the format “City, ST” “Chicago, IL”, “Detroit, MI”
\d\s*(\w+) Match a number followed by zero or more spaces followed by one or more letters. Capture the letters. 1 apple -> apple, 2 oranges -> oranges, 3bananas -> bananas

We’ll learn how to construct these patterns in the next section, first let’s take a look at how to use them in Python.

Regular Expressions in Python

Python, like many other languages, has a built-in regular expression module.

  • re.findall - Find all occurrences of a pattern in a string.
  • re.finditer - Find all occurrences of a pattern in a string, and return an iterator.
  • re.search - Find the first occurrence of a pattern in a string.
  • re.fullmatch - Check if a string matches a pattern exactly.
  • re.match - Check if a string matches a pattern from the start.
  • re.sub - Replace all occurrences of a pattern in a string with another string.
  • re.split - Split a string into a list of substrings based on a pattern.

Python: re module

Search: re.findall, re.finditer, re.search

re.findall is used to find all occurrences of a pattern in a string and return them all at once in a list.

re.finditer returns a lazy iterator of all matches that’ll let you iterate over them one at a time.

re.search finds the first match in the string.

import re

# find all four letter words
re.findall(r" (\w{4}) ", "The quick brown fox jumps over the lazy dog.")
['over', 'lazy']

Validation: re.fullmatch, re.match

These functions all take a pattern and a string, and return a match object if the pattern is found in the string.

  • re.fullmatch only matches if the pattern matches the entire string.
  • re.match only matches at the beginning of the string. (Meaning if the pattern is found at the beginning of the string, but the string continues after that, it still counts as a match.)

(re.search is like match but looks anywhere within the string.)

import re

def validate_phone_number(phone_number):
    return re.fullmatch(r"\d{3}-\d{3}-\d{4}", phone_number)

def validate_phone_number_bad(phone_number):
    if len(phone_number) != 12:
        return False
    if phone_number[3] != '-':
        return False
    if phone_number[7] != '-':
        return False
    for i in range(12):
        if i == 3 or i == 7:
            continue
        if not phone_number[i].isdigit():
            return False
    return True
# which will match and which will not?
pattern = r"\d{3}-\d{3}-\d{4}"
functions = [re.fullmatch, re.search, re.match]
strings = ["202-111-5555", "Emily's number is 555-123-4444", "202-111-3300abcdef"]

for f in functions:
    for s in strings:
        # print the name of the function, the string, and the result
        matches = f(pattern, s) is not None
        print(f"{f.__name__:<10} {s:<40} {matches}")
fullmatch  202-111-5555                             True
fullmatch  Emily's number is 555-123-4444           False
fullmatch  202-111-3300abcdef                       False
search     202-111-5555                             True
search     Emily's number is 555-123-4444           True
search     202-111-3300abcdef                       True
match      202-111-5555                             True
match      Emily's number is 555-123-4444           False
match      202-111-3300abcdef                       True

validate_phone_number is much easier to read and understand, and it is much easier to maintain. It also takes less than a second to run 1,000,000 validations.

The naive validate_phone_number_bad takes about twice as long. With large data sets, and dozens of complex validations, the difference can be significant

re.sub

re.sub is used to replace all occurrences of a pattern in a string.

import re

text = "The quick brown fox jumps over the lazy dog."
text = re.sub(r"\s\w{3}\s", "~", text)
print(text)
The quick brown~jumps over~lazy dog.

re.sub is very fast, and often allows us to do things that are much more complicated than a simple string replace would thus multiplying the speedup.

re.split

re.split is used to split a string into a list of substrings based on a pattern. This is less commonly used, but can be useful if something like the standard CSV parser can’t handle a particular format.

# split string apart on punctuation (similar to str.split but can use patterns)
re.split(r"Where", "Wow! Where? I don't know")
['Wow! ', "? I don't know"]

re.compile

If you are going to use the same pattern multiple times, it is more efficient to compile the pattern into a regular expression object.

For validation for example:

import re

phone_number_pattern = re.compile(r"\d{3}-\d{3}-\d{4}")

# phone_number pattern is a compiled regular expression object
print(type(phone_number_pattern))
print(phone_number_pattern)

def validate_phone_number(phone_number):
    # re.Pattern objects have all of the same methods that `re` does,
    # you just omit the pattern argument
    return phone_number_pattern.fullmatch(phone_number)
<class 're.Pattern'>
re.compile('\\d{3}-\\d{3}-\\d{4}')

This means the expensive work of building the regular expression is now done outside the loop. This leads to speedup of more than 50% when used to validate ~1 million numbers using a similar regex.

All of the re. methods can be called on a compiled regular expression object.

For example: re.findall(pattern, text) is the same as pattern.findall(text) on a compiled pattern.

Flags

Python’s regular expression module supports a number of flags that can be passed to the re.compile function or any of the methods.

  • re.IGNORECASE - ignore case when matching
  • re.MULTILINE - treat the string as multiple lines when evaluating certain patterns (e.g. ^ and $)
  • re.DOTALL - allow . to match newlines
  • re.VERBOSE - allow comments and whitespace in the pattern
import re

text = """
The quick brown fox jumps over the lazy dog...
Lorem ipsum dolor sit amet, consectetur adipiscing elit...
But I must explain to you how all this mistaken idea...
Then I saw the storm coming...
"""

pattern = re.compile(r"""
    ^      # start of line
    \w{3}  # 3 letter word
    \s     # whitespace
    \w*    # any number of letters
    \s     # whitespace
    \w*    # include third word
""", re.VERBOSE | re.IGNORECASE | re.MULTILINE) # combine flags with the | operator

pattern.findall(text)
# will return ["The quick brown", "But I must"]
['The quick brown', 'But I must']

Regular Expression Syntax

Regular expressions describe patterns.

If we stick to letters and numbers, they describe an exact match:

This means the regex a will match the string "a" but not "b" or "A". The regex ab will match the string "ab" but not "ba" or "AB" or "a".

We can use | to mean “or” and parenthesis () for grouping:

(c|b|r)at would match "cat", "bat", or "rat".

Character Classes

While it’d be possible to write a regex like: (1|2|3|4|5|6|7|8|9|0) to match a single digit, it’s much easier to use a range or character class.

We could instead write [0-9] or \d.

[] - matches any character in the brackets, allows lexical ranges Prefixing the character class with a ^ will match any character not in the brackets.

  • [0-9] - matches any digit
  • [^abc] - matches any character except a, b, or c
  • [a-z] - matches any lowercase letter
  • [A-Z] - matches any uppercase letter
  • [^ \n\t] - matches any character except whitespace
  • [a-zA-Z] - matches any letter in either case
  • [aeiou] - matches any vowel

A couple of things to note:

  • There are letters (and numbers!) that are not in the range a-z or A-Z from non-English languages. These are not included in the above ranges when specifying a-z.
  • If you need to match a literal like ] that has a special meaning, you prefix it with a backslash.

Certain character classes are so common they have their own shortcuts:

  • \d - matches any digit
  • \D - matches any non-digit
  • \w - matches any alphanumeric character
  • \W - matches any non-alphanumeric character
  • \s - matches any whitespace character
  • \S - matches any non-whitespace character

Character Class Practice

Any character that is not a vowel.

All strings made up of letters, numbers, underscores and dashes.

(The - at the beginning is used to add a dash.)

[^0-9] or \D

[A-Z][a-z][a-z]

Quantifiers

Quantifiers are used to specify how many times a pattern should be matched.

These quantifiers can occur after any piece of a pattern:

  • ? - match 0 or 1 times
  • * - match 0 or more times
  • + - match 1 or more times
  • {n} - match exactly n times
  • {n,} - match at least n times
  • {n,m} - match at least n times but no more than m times

If there is no quantifier, the pattern is matched exactly once.

These can be combined with any kind of pattern:

  • a? - matches 0 or 1 a
  • (e|i)* - matches 0 or more “e” or “i”
  • \s+ - matches 1 or more whitespace characters
  • [abc]{3} - matches exactly 3 characters from class abc (aaa, bbb, abc, cbc, bba, etc.)
  • \d{3,} - matches at least 3 digits
  • \d{3,5} - matches at least 3 digits but no more than 5

By default quantifiers are greedy. This means they will match as many times as possible.

If you want to make an operator non-greedy, you can add a ? after it. This is commonly used with * and + which can otherwise consume too much of the string.

  • a*? - matches 0 or more a but as few as possible based on the rest of the pattern
  • \W+? - matches 1 or more non-alphanumeric characters but as few as possible based on the rest of the pattern
# Note: don't use regex to parse HTML, HTML is too messy in practice to be
# reliably pattern-matched.  Use a parser like `lxml.html`!
html = "<div>one</div> <div>two</div>"
print("greedy match", re.findall("<div>.*</div>", html))
print("non-greedy-match", re.findall("<div>.*?</div>", html))
greedy match ['<div>one</div> <div>two</div>']
non-greedy-match ['<div>one</div>', '<div>two</div>']

Anchors

Anchors are used to match the beginning or end of a string.

  • ^ - matches the beginning of a string
  • $ - matches the end of a string
  • \A - matches the beginning of a line (same as ^ if in MULTILINE mode)
  • \Z - matches the end of a line (same as $ if in MULTILINE mode)
  • \b - matches a word boundary (a special symbol that looks for the boundary between a sequence of alphanumeric characters and a sequence of non-alphanumeric characters)

Word boundaries can be particularly useful because we don’t know if a word is bounded by space or punctuation.

# all strings of letters or apostrophes that are terminated by a word boundary
# that might be a space, comma, hyphen, period, etc.
re.findall(r"['\w]+\b", "This is a list of sometimes-punctuated words, that we don't *want* punctuation from.")
['This',
 'is',
 'a',
 'list',
 'of',
 'sometimes',
 'punctuated',
 'words',
 'that',
 'we',
 "don't",
 'want',
 'punctuation',
 'from']

Groups

We’ve seen that parentheses can be used to group patterns together.

This is useful for applying quantifiers to multiple patterns at once. For example:

  • (ab)+ - matches 1 or more ab, but not a or b on their own.
  • (a|an|the)? - optionally matches an, a, or the

It also allows us to refer to the group later in the pattern:

  • (\w+) \1 - matches a word followed by the same word again. For example, the the or dog dog but not the dog or dog the.

\1 refers to the first group, \2 refers to the second group, etc.

This can also be used when using the sub method of the re module, for example:

import re

text = "Hello, and welcome to CAPP 30122. Congratulations on being done with CAPP 30121!"

# we replace all 30___ with 99___ numbers
re.sub(r"30(\d+)", r"99\1", text)
'Hello, and welcome to CAPP 99122. Congratulations on being done with CAPP 99121!'

Practical Example

To demonstrate the application of regular expressions, and get a sense of how much of a benefit we might get from using them, let’s consider an example.

Your team is working on a project to analyze court filings.

  • One member of your team is working on OCR (Optical Character Recognition) to convert scanned documents into text files.
  • Another member of your team will be visualizing the data, and they need the counts of ten key terms in each document.
  • Your job, as the newest intern, is to write a function that takes a word and a text file, and returns the number of times that word appears in the text file.

This seems like a simple task. You don’t have real data yet so we’ll take a free text file of comparable size from Project Gutenberg, and use that.

all_text = open('shakespeare.txt').read()

def count_word(word, text):
    counter = 0
    for w in text.split():
        if w == word:
            counter += 1
    return counter

How long does this take?

import timeit

# tried 100000, took forever... 
# tried 10000, still took forever...
# let's see if 10 works?
number = 1000
timeit.timeit("count_word('Romeo', all_text)", globals=globals(), number=number)

On the machine this was written on, this took 41 seconds to run 1000 searches. Our 100k documents will take 4000 seconds, over an hour!

It seems like we could do better, but an hour is acceptable for now so we move on.

Example 2

During code review it is pointed out you will need to ignore case:

def count_word(word, text):
    counter = 0
    for w in text.split():
        if w.lower() == word.lower():
            counter += 1
    return counter

How long does this take?

timeit.timeit("count_word('Romeo', all_text)", globals=globals(), number=number)

79 seconds, about twice as long.

This would take two hours to run 100,000 documents.

Why did it take twice as long?

There’s one easy optimization that can shave about 40ms/iteration off, what is it?

The calls to lower() are doubling the time.

We can call lower() outside the loop to get a lot of that time back.

Example 3

As you wonder what will happen as the corpus of text grows, you hear that there are new requirements:

  • Ignore punctuation
  • Ignore plurals (for our purposes we can just ignore trailing s characters)
  • Sometimes page numbers are showing up in the middle of scans, and we want to ignore those too, so strip all number characters.
def count_word(word, text):
    counter = 0
    # do this outside the loop at least
    text = text.lower()
    for word in text.split():
        # remove all numeric characters that might appear inside words
        w = "".join([c for c in word if c not in '0123456789'])
        # remove leading/trailing punctuation (but not punctuation in the middle)
        w = w.strip('.,!?;:"\'') 
        if w == word or w + "s" == word:
            counter += 1
    return counter
timeit.timeit("count_word('Romeo', all_text)", globals=globals(), number=number)

291 seconds, that is 7x longer. Our total corpus would take 14 hours now.

Each time we add a new requirement we have to iterate over each word in the text. We’re doomed to get slower as the code gets more complex.

Example 4

But what if we could do all of that work in a single pass?

import re

def count_word(word, text):
    # remove all non-alphabetical characters that might appear inside words
    text = re.sub(r'[\d.,!?;:"\']', '', text)
    # return number of matches of word separated by "word boundaries" with optional trailing s
    return len(re.findall(r"\W" + word + "s?\W", text, re.IGNORECASE))
timeit.timeit("count_word('Romeo', all_text)", globals=globals(), number=number)

100 seconds, a 66% reduction with all the same features. Adding new features will only have marginal cost as well, instead of adding a multiplier as we saw with the standard O(N) string methods.

This would be a real-world improvement of 9 hours off our total run.

Further Exploration

  • Python Regex Howto
  • pythex - Interactive Python regular expression tester & cheat sheet.
  • regex101 - Interactive regular expression tester & cheat sheet. (Be sure to select Python!)
  • pydantic - Data validation and settings management using Python type annotations.