= "abcdefghijklmnopqrstuvwxyzMARIOabcdefghijklmnopqrstuvwxyzLUIGIabcdef"
text
# this contains two O(N) scans of the string, lower() and __contains__
if "mario" in text.lower():
print("found mario!")
found mario!
Goals
In the past few decades modern computers have gotten exponentially more efficient at working with numerical data, a side effect of the demands of modern graphics. Libraries like NumPy
and polars
can take advantage of these optimizations, and libraries like pandas
and shapely
can benefit.
However, in many domains, a surprising amount of time is spent manipulating strings: consider how fundamental HTTP is, comparing strings, cleaning them, extracting information from a larger string.
While modern computers have gotten faster (able to perform more tasks per second) this has mostly been a linear improvement.
The challenge in part stems from the fact that unlike math, where we can parallelize a lot of operations, string manipulation is inherently sequential. It often involves iterating over something one character at a time, limiting our ability to optimize.
In Python, strings are immutable. Functions do not modify strings, they must return a new string.
Let’s what a simplified implementation of str.replace
that can only replace a single character might look like:
def replace_one_char(self, from_ch, to_ch):
# assume the existence of an internal array of characters
= []
new_chars for char in self._characters:
if char == from_ch:
new_chars.append(to_ch)else:
new_chars.append(char)return "".join(new_chars) # convert to a string
While the actual implementation would be in C, the algorithm itself can’t really be more efficient than this. It will need to both scan the entire string and make a copy of the string,
str.replace
which needs to be able to account for runs of characters, not just a single character at a time.
This means we’ll be spending a lot of time between
Method(s) | Description |
---|---|
str.upper() , str.lower() |
Convert to upper or lower case. |
str.isupper() , str.islower() |
Check if all characters are upper or lower case. |
str.strip() , str.lstrip() , str.rstrip() |
Remove white space from the beginning or end of a string. |
str.replace() |
Replace all occurrences of a string with another string. |
str.split() |
Split a string into a list of substrings. |
str.startswith() , str.endswith() |
Check if a string starts or ends with a substring. |
substr in str |
Check if a string contains a substring. |
str.count() |
Count the number of occurrences of a substring. |
str.find() , str.rfind() |
Find the index of the first or last occurrence of a substring. |
str.index() ,str.rindex() |
Find the index of the first or last occurrence of a substring. |
str.isalpha() , str.isalnum() , str.isdigit() , etc. |
Check if all characters in a string are alphabetic, alphanumeric, digits, etc. |
Operation | Description | Time Complexity |
---|---|---|
Comparing | For equality or for sorting | |
Searching | Finding a substring in a string | |
Replacing | Replacing a substring with another substring | |
Splitting | Splitting a string into a list of substrings based on some character(s) | |
Validating/Matching | Checking if a string is in a particular format |
Not only are the string operations
= "abcdefghijklmnopqrstuvwxyzMARIOabcdefghijklmnopqrstuvwxyzLUIGIabcdef"
text
# this contains two O(N) scans of the string, lower() and __contains__
if "mario" in text.lower():
print("found mario!")
found mario!
And if we wanted to try to find two strings:
# now we're at four scans of the string
if "mario" in text.lower():
print("found mario!")
if "luigi" in text.lower():
print("found luigi!")
found mario!
found luigi!
We could cut this down by saving text.lower()
, but if we wanted to search for lots of strings, we wind up repeating the
= ("mario", "luigi", "peach")
searches = text.lower()
lower_text for search in searches:
if search in lower_text:
print(f"found {search}")
found mario
found luigi
We could in theory optimize this. We would need to write code that analyzed one character at a time and maintained detailed state:
# if either hits 5, word is found
= 0
mario_matches = 0
luigi_matches
for ch in text:
if ch == "m":
= 1
mario_matches = 0
luigi_matches elif ch == "a":
= 0
luigi_matches if mario_matches == 1:
+= 1
mario_matches else:
= 0
mario_matches elif ch == "i"
# mar-i
if mario_matches == 3:
+= 1
mario_matches else:
= 0
mario_matches # lu-i or luig-i
if luigi_matches == 2:
+= 1
luigi_matches elif luigi_matches == 4:
= 5 # FOUND FULL STRING!
luigi_matches # etc...
Doing this for each variation of strings we’re searching for would be tedious & error-prone. Instead, we will turn to regular expressions.
Regular expressions are a notation for pattern-matching common to many programming languages. While each language has variations, the syntax is more alike than different.
Regular expressions are used in many different contexts:
Pattern | Explanation | Example Matches |
---|---|---|
pies? |
Match the word “pie” or “pies” | “pie”, “pies” |
c[aou]t |
Match words words that start and end with c & t, and have a/o/u in the middle. | “cat”, “cot”, “cut” |
\d{3}-\d{3}-\d{4} |
Match a phone number in dashed format. | “123-456-7890” |
[A-Z][a-z]+, [A-Z]{2} |
Match a city name in the format “City, ST” | “Chicago, IL”, “Detroit, MI” |
\d\s*(\w+) |
Match a number followed by zero or more spaces followed by one or more letters. Capture the letters. | 1 apple -> apple, 2 oranges -> oranges, 3bananas -> bananas |
We’ll learn how to construct these patterns in the next section, first let’s take a look at how to use them in Python.
Python, like many other languages, has a built-in regular expression module.
re.findall
- Find all occurrences of a pattern in a string.re.finditer
- Find all occurrences of a pattern in a string, and return an iterator.re.search
- Find the first occurrence of a pattern in a string.re.fullmatch
- Check if a string matches a pattern exactly.re.match
- Check if a string matches a pattern from the start.re.sub
- Replace all occurrences of a pattern in a string with another string.re.split
- Split a string into a list of substrings based on a pattern.re.findall
, re.finditer
, re.search
re.findall
is used to find all occurrences of a pattern in a string and return them all at once in a list.
re.finditer
returns a lazy iterator of all matches that’ll let you iterate over them one at a time.
re.search
finds the first match in the string.
import re
# find all four letter words
r" (\w{4}) ", "The quick brown fox jumps over the lazy dog.") re.findall(
['over', 'lazy']
re.fullmatch
, re.match
These functions all take a pattern and a string, and return a match object if the pattern is found in the string.
re.fullmatch
only matches if the pattern matches the entire string.re.match
only matches at the beginning of the string. (Meaning if the pattern is found at the beginning of the string, but the string continues after that, it still counts as a match.)(re.search
is like match
but looks anywhere within the string.)
import re
def validate_phone_number(phone_number):
return re.fullmatch(r"\d{3}-\d{3}-\d{4}", phone_number)
def validate_phone_number_bad(phone_number):
if len(phone_number) != 12:
return False
if phone_number[3] != '-':
return False
if phone_number[7] != '-':
return False
for i in range(12):
if i == 3 or i == 7:
continue
if not phone_number[i].isdigit():
return False
return True
# which will match and which will not?
= r"\d{3}-\d{3}-\d{4}"
pattern = [re.fullmatch, re.search, re.match]
functions = ["202-111-5555", "Emily's number is 555-123-4444", "202-111-3300abcdef"]
strings
for f in functions:
for s in strings:
# print the name of the function, the string, and the result
= f(pattern, s) is not None
matches print(f"{f.__name__:<10} {s:<40} {matches}")
fullmatch 202-111-5555 True
fullmatch Emily's number is 555-123-4444 False
fullmatch 202-111-3300abcdef False
search 202-111-5555 True
search Emily's number is 555-123-4444 True
search 202-111-3300abcdef True
match 202-111-5555 True
match Emily's number is 555-123-4444 False
match 202-111-3300abcdef True
validate_phone_number
is much easier to read and understand, and it is much easier to maintain. It also takes less than a second to run 1,000,000 validations.
The naive validate_phone_number_bad
takes about twice as long. With large data sets, and dozens of complex validations, the difference can be significant
re.sub
re.sub
is used to replace all occurrences of a pattern in a string.
import re
= "The quick brown fox jumps over the lazy dog."
text = re.sub(r"\s\w{3}\s", "~", text)
text print(text)
The quick brown~jumps over~lazy dog.
re.sub
is very fast, and often allows us to do things that are much more complicated than a simple string replace would thus multiplying the speedup.
re.split
re.split
is used to split a string into a list of substrings based on a pattern. This is less commonly used, but can be useful if something like the standard CSV parser can’t handle a particular format.
# split string apart on punctuation (similar to str.split but can use patterns)
r"Where", "Wow! Where? I don't know") re.split(
['Wow! ', "? I don't know"]
re.compile
If you are going to use the same pattern multiple times, it is more efficient to compile the pattern into a regular expression object.
For validation for example:
import re
= re.compile(r"\d{3}-\d{3}-\d{4}")
phone_number_pattern
# phone_number pattern is a compiled regular expression object
print(type(phone_number_pattern))
print(phone_number_pattern)
def validate_phone_number(phone_number):
# re.Pattern objects have all of the same methods that `re` does,
# you just omit the pattern argument
return phone_number_pattern.fullmatch(phone_number)
<class 're.Pattern'>
re.compile('\\d{3}-\\d{3}-\\d{4}')
This means the expensive work of building the regular expression is now done outside the loop. This leads to speedup of more than 50% when used to validate ~1 million numbers using a similar regex.
All of the re.
methods can be called on a compiled regular expression object.
For example: re.findall(pattern, text)
is the same as pattern.findall(text)
on a compiled pattern
.
Python’s regular expression module supports a number of flags that can be passed to the re.compile
function or any of the methods.
re.IGNORECASE
- ignore case when matchingre.MULTILINE
- treat the string as multiple lines when evaluating certain patterns (e.g. ^
and $
)re.DOTALL
- allow .
to match newlinesre.VERBOSE
- allow comments and whitespace in the patternimport re
= """
text The quick brown fox jumps over the lazy dog...
Lorem ipsum dolor sit amet, consectetur adipiscing elit...
But I must explain to you how all this mistaken idea...
Then I saw the storm coming...
"""
= re.compile(r"""
pattern ^ # start of line
\w{3} # 3 letter word
\s # whitespace
\w* # any number of letters
\s # whitespace
\w* # include third word
""", re.VERBOSE | re.IGNORECASE | re.MULTILINE) # combine flags with the | operator
pattern.findall(text)# will return ["The quick brown", "But I must"]
['The quick brown', 'But I must']
Regular expressions describe patterns.
If we stick to letters and numbers, they describe an exact match:
This means the regex a
will match the string "a"
but not "b"
or "A"
. The regex ab
will match the string "ab"
but not "ba"
or "AB"
or "a"
.
We can use |
to mean “or” and parenthesis ()
for grouping:
(c|b|r)at
would match "cat"
, "bat"
, or "rat"
.
While it’d be possible to write a regex like: (1|2|3|4|5|6|7|8|9|0)
to match a single digit, it’s much easier to use a range or character class.
We could instead write [0-9]
or \d
.
[]
- matches any character in the brackets, allows lexical ranges Prefixing the character class with a ^ will match any character not in the brackets.
[0-9]
- matches any digit[^abc]
- matches any character except a, b, or c[a-z]
- matches any lowercase letter[A-Z]
- matches any uppercase letter[^ \n\t]
- matches any character except whitespace[a-zA-Z]
- matches any letter in either case[aeiou]
- matches any vowelA couple of things to note:
]
that has a special meaning, you prefix it with a backslash.Certain character classes are so common they have their own shortcuts:
\d
- matches any digit\D
- matches any non-digit\w
- matches any alphanumeric character\W
- matches any non-alphanumeric character\s
- matches any whitespace character\S
- matches any non-whitespace character[^aeiou]
Any character that is not a vowel.
[-_a-zA-Z0-9]
All strings made up of letters, numbers, underscores and dashes.
(The -
at the beginning is used to add a dash.)
[^0-9]
or \D
[A-Z][a-z][a-z]
Quantifiers are used to specify how many times a pattern should be matched.
These quantifiers can occur after any piece of a pattern:
?
- match 0 or 1 times*
- match 0 or more times+
- match 1 or more times{n}
- match exactly n times{n,}
- match at least n times{n,m}
- match at least n times but no more than m timesIf there is no quantifier, the pattern is matched exactly once.
These can be combined with any kind of pattern:
a?
- matches 0 or 1 a
(e|i)*
- matches 0 or more “e” or “i”\s+
- matches 1 or more whitespace characters[abc]{3}
- matches exactly 3 characters from class abc (aaa, bbb, abc, cbc, bba, etc.)\d{3,}
- matches at least 3 digits\d{3,5}
- matches at least 3 digits but no more than 5By default quantifiers are greedy. This means they will match as many times as possible.
If you want to make an operator non-greedy, you can add a ?
after it. This is commonly used with *
and +
which can otherwise consume too much of the string.
a*?
- matches 0 or more a
but as few as possible based on the rest of the pattern\W+?
- matches 1 or more non-alphanumeric characters but as few as possible based on the rest of the pattern# Note: don't use regex to parse HTML, HTML is too messy in practice to be
# reliably pattern-matched. Use a parser like `lxml.html`!
= "<div>one</div> <div>two</div>"
html print("greedy match", re.findall("<div>.*</div>", html))
print("non-greedy-match", re.findall("<div>.*?</div>", html))
greedy match ['<div>one</div> <div>two</div>']
non-greedy-match ['<div>one</div>', '<div>two</div>']
Anchors are used to match the beginning or end of a string.
^
- matches the beginning of a string$
- matches the end of a string\A
- matches the beginning of a line (same as ^ if in MULTILINE mode)\Z
- matches the end of a line (same as $ if in MULTILINE mode)\b
- matches a word boundary (a special symbol that looks for the boundary between a sequence of alphanumeric characters and a sequence of non-alphanumeric characters)Word boundaries can be particularly useful because we don’t know if a word is bounded by space or punctuation.
# all strings of letters or apostrophes that are terminated by a word boundary
# that might be a space, comma, hyphen, period, etc.
r"['\w]+\b", "This is a list of sometimes-punctuated words, that we don't *want* punctuation from.") re.findall(
['This',
'is',
'a',
'list',
'of',
'sometimes',
'punctuated',
'words',
'that',
'we',
"don't",
'want',
'punctuation',
'from']
We’ve seen that parentheses can be used to group patterns together.
This is useful for applying quantifiers to multiple patterns at once. For example:
(ab)+
- matches 1 or more ab
, but not a
or b
on their own.(a|an|the)?
- optionally matches an
, a
, or the
It also allows us to refer to the group later in the pattern:
(\w+) \1
- matches a word followed by the same word again. For example, the the
or dog dog
but not the dog
or dog the
.\1 refers to the first group, \2 refers to the second group, etc.
This can also be used when using the sub
method of the re
module, for example:
import re
= "Hello, and welcome to CAPP 30122. Congratulations on being done with CAPP 30121!"
text
# we replace all 30___ with 99___ numbers
r"30(\d+)", r"99\1", text) re.sub(
'Hello, and welcome to CAPP 99122. Congratulations on being done with CAPP 99121!'
To demonstrate the application of regular expressions, and get a sense of how much of a benefit we might get from using them, let’s consider an example.
Your team is working on a project to analyze court filings.
This seems like a simple task. You don’t have real data yet so we’ll take a free text file of comparable size from Project Gutenberg, and use that.
= open('shakespeare.txt').read()
all_text
def count_word(word, text):
= 0
counter for w in text.split():
if w == word:
+= 1
counter return counter
How long does this take?
import timeit
# tried 100000, took forever...
# tried 10000, still took forever...
# let's see if 10 works?
= 1000
number "count_word('Romeo', all_text)", globals=globals(), number=number) timeit.timeit(
On the machine this was written on, this took 41 seconds to run 1000 searches. Our 100k documents will take 4000 seconds, over an hour!
It seems like we could do better, but an hour is acceptable for now so we move on.
During code review it is pointed out you will need to ignore case:
def count_word(word, text):
= 0
counter for w in text.split():
if w.lower() == word.lower():
+= 1
counter return counter
How long does this take?
"count_word('Romeo', all_text)", globals=globals(), number=number) timeit.timeit(
79 seconds, about twice as long.
This would take two hours to run 100,000 documents.
Why did it take twice as long?
There’s one easy optimization that can shave about 40ms/iteration off, what is it?
The calls to lower()
are doubling the time.
We can call lower()
outside the loop to get a lot of that time back.
As you wonder what will happen as the corpus of text grows, you hear that there are new requirements:
def count_word(word, text):
= 0
counter # do this outside the loop at least
= text.lower()
text for word in text.split():
# remove all numeric characters that might appear inside words
= "".join([c for c in word if c not in '0123456789'])
w # remove leading/trailing punctuation (but not punctuation in the middle)
= w.strip('.,!?;:"\'')
w if w == word or w + "s" == word:
+= 1
counter return counter
"count_word('Romeo', all_text)", globals=globals(), number=number) timeit.timeit(
291 seconds, that is 7x longer. Our total corpus would take 14 hours now.
Each time we add a new requirement we have to iterate over each word in the text. We’re doomed to get slower as the code gets more complex.
But what if we could do all of that work in a single pass?
import re
def count_word(word, text):
# remove all non-alphabetical characters that might appear inside words
= re.sub(r'[\d.,!?;:"\']', '', text)
text # return number of matches of word separated by "word boundaries" with optional trailing s
return len(re.findall(r"\W" + word + "s?\W", text, re.IGNORECASE))
"count_word('Romeo', all_text)", globals=globals(), number=number) timeit.timeit(
100 seconds, a 66% reduction with all the same features. Adding new features will only have marginal cost as well, instead of adding a multiplier as we saw with the standard
This would be a real-world improvement of 9 hours off our total run.