Regular Expressions

Raw Strings

Backslashes are used to escape characters in strings.

>>> file_name = "C:\projects\nathan"
>>> file_name
'C:\\projects\nathan'
>>> print(file_name)
C:\projects
athan

If we want to use a literal backslash, we need to escape it by putting two backslashes instead:

>>> file_name = "C:\\projects\\nathan"
>>> file_name
'C:\\projects\\nathan'
>>> print(file_name)
C:\projects\nathan

We can turn off character escaping completely by using “raw” strings, which you can make by prefixing your string with an r character:

>>> file_name = r"C:\projects\nathan"
>>> file_name
'C:\\projects\\nathan'
>>> print(file_name)
C:\projects\nathan

search

The regular expression module includes helper tools that allows us to use regular expressions to work with strings.

We can search for one string in another string with the search function.

>>> import re
>>> greeting = "hello world"
>>> re.search("hello", greeting)
<re.Match object; span=(0, 5), match='hello'>
>>> match = re.search("hello", greeting)
>>> help(match)  

>>> match.start()
0
>>> match.end()
5
>>> match.group()
'hello'
>>> greeting[match.start():match.end()]
'hello'

That example isn’t particularly interesting because we could do that with the string find method.

We can use \w (the w must be lowercase) to search for “word” characters (alphabetical characters or underscore):

>>> match = re.search(r'\w', greeting)
>>> match
<re.Match object; span=(0, 1), match='h'>
>>> match.group()
'h'

Notice that we are using raw strings here so we don’t need to escape our backslashes.

The * command will cause the previous command to be matched zero or more times, matching as many times as possible. Since spaces are not word characters, this match stops just before the first space character.

>>> match = re.search(r'\w*', greeting)
>>> match.group()
'hello'

You can find more information on the regular expression syntax in the re module documentation.

findall

The findall function works similar to search except that instead of a match object, it returns a list of every match in a string.

>>> match = re.findall(r'\w*', greeting)
>>> match
['hello', '', 'world', '']

Note that the space and end of string result in empty strings in this list because we’re searching for zero or more word characters, and they do match zero word characters.

The + command will match the previous command to be one or more times. This will remove the empty strings since space and the end of the string do not have one or more word characters.

>>> re.findall(r'\w+', greeting)
['hello', 'world']

More regex syntax

The . command will match any character:

>>> re.findall(r'o.', greeting)
['o ', 'or']
>>> re.findall(r'.l.', greeting)
['ell', 'rld']

If we want to match a literal period character we will need to use a backslash to escape the period:

>>> re.findall(r'\.', "hi.")
['.']

We can use square brackets to match any character in a group:

>>> re.findall(r'[aeiou]', greeting)
['e', 'o', 'o']

We can also use a dash to match characters within ranges of characters:

>>> re.findall(r'[a-z]', "Hi there")
['i', 't', 'h', 'e', 'r', 'e']
>>> re.findall(r'[A-Za-z]', "Hi there")
['H', 'i', 't', 'h', 'e', 'r', 'e']
>>> re.findall(r'[0-9]', "Hi there")
[]

The caret character can be used at the beginning of a character set to denote negation. This will match anything except for lowercase letters:

>>> re.findall(r'[^a-z]', "Hi there")
['H', ' ']

split

Let’s split a string by words:

>>> sentence = "Oh what a day, what a lovely day!"
>>> from collections import Counter
>>> sentence.split()
['Oh', 'what', 'a', 'day,', 'what', 'a', 'lovely', 'day!']
>>> Counter(sentence.split())
Counter({'what': 2, 'a': 2, 'Oh': 1, 'day,': 1, 'lovely': 1, 'day!': 1})

The split method on strings splits based on whitespace characters.

With the split function in the re, we can split a string by a regular expression.

Let’s use \W to split based on one or more “non-word” characters

>>> re.split(r'\W+', sentence)
['Oh', 'what', 'a', 'day', 'what', 'a', 'lovely', 'day', '']
>>> Counter(re.split(r'\W+', sentence))
Counter({'what': 2, 'a': 2, 'day': 2, 'Oh': 1, 'lovely': 1, '': 1})
>>> Counter(filter(None, re.split(r'\W+', sentence)))
Counter({'what': 2, 'a': 2, 'day': 2, 'Oh': 1, 'lovely': 1})

We could accomplish nearly the same thing by using findall to find all words.

>>> re.findall('\w+', sentence)
['Oh', 'what', 'a', 'day', 'what', 'a', 'lovely', 'day']
>>> Counter(re.findall('\w+', sentence))
Counter({'what': 2, 'a': 2, 'day': 2, 'Oh': 1, 'lovely': 1})

match

A caret can be put at the beginning of a regular expression to denote that it should only match starting at the beginning of the target string. Similarly, a dollar sign can be put at the end of a regular expression to denote that it should only match starting at the end of the target string.

>>> re.search(r'^wo', "hello world")
>>> re.search(r'^wo', "world")
<re.Match object; span=(0, 2), match='wo'>
>>> re.search(r'^wo$', "world")
>>> re.search(r'^wo$', "wo")
<re.Match object; span=(0, 2), match='wo'>

The match function is the same as using search with a caret. So match requires that the pattern to start at the beginning of the string.

>>> re.match('what', sentence)
>>> re.match('.*what', sentence)
<re.Match object; span=(0, 19), match='Oh what a day, what'>
>>> re.match('Oh', sentence)
<re.Match object; span=(0, 2), match='Oh'>

The fullmatch function is the same as using search with both a caret and a dollar sign. So fullmath requires the pattern to match the entire string.

>>> re.fullmatch(r'what', sentence)
>>> re.fullmatch(r'.*what', sentence)
>>> re.fullmatch(r'.*what.*', sentence)
<re.Match object; span=(0, 33), match='Oh what a day, what a lovely day!'>

sub

You can use the sub function to replace parts of a string.

Let’s replace all vowels with the letter “x”:

>>> re.sub(r'[aeiou]', r"x", greeting)
'hxllx wxrld'

We can use parentheses to group parts of regular expressions. Groups can be referenced by their number in the replacement string:

>>> re.sub(r'([aeiou])', r"x\1", greeting)
'hxellxo wxorld'

We can use \d to match a digit character and \D to match a non-digit character. Curly braces can be used with a number inside to repeat the last command a certain number of times.

Let’s make a regular expression to normalize phone numbers:

>>> re.sub(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*', r"\1-\2-\3", "(202) 456-1111")
'202-456-1111'
>>> re.sub(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*', r"\1-\2-\3", "202 - 456 - 1111")
'202-456-1111'
>>> re.sub(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*', r"\1-\2-\3", "202.456.1111")
'202-456-1111'

compile

If we need to use the same regular expression multiple times throughout our program, it’s a good idea to pre-compile it to increase performance.

>>> phone_re = re.compile(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*')
>>> phone_re.sub(r"(\1) \2 - \3", "202.456.1111")
'(202) 456 - 1111'
>>> phone_re.sub(r"\1-\2-\3", "202.456.1111")
'202-456-1111'

Grouping

So far our regular expressions have consisted solely of commands that match individual letters or allow repetition of individual letter matches.

What if we want to act on a group?

For example what if we want to match US ZIP codes in their shortened form or their full form?

Here’s a regular expression for matching shortened ZIP codes:

>>> re.search(r'^\d{5}$', '90210')
<re.Match object; span=(0, 5), match='90210'>

A full ZIP code match looks like this:

>>> re.search(r'^\d{5}-\d{4}$', '90210-4873')
<re.Match object; span=(0, 10), match='90210-4873'>

So far we haven’t seen a way to make that last part optional.

We could try putting a question mark after the - and the repetition:

>>> re.search(r'^\d{5}-?\d{4}?$', '90210-4873')
<re.Match object; span=(0, 10), match='90210-4873'>
>>> re.search(r'^\d{5}-?\d{4}?$', '902104873')
<re.Match object; span=(0, 9), match='902104873'>
>>> re.search(r'^\d{5}-?\d{4}?$', '90210-')

That matches strange things though (also what’s up with that ? after the repetition count?).

To optionally match a number of consecutive character patterns, we can use a group:

>>> re.search(r'^\d{5}(-\d{4})?$', '90210-4873')
<re.Match object; span=(0, 10), match='90210-4873'>
>>> re.search(r'^\d{5}(-\d{4})?$', '90210')
<re.Match object; span=(0, 5), match='90210'>
>>> re.search(r'^\d{5}(-\d{4})?$', '902104873')
>>> re.search(r'^\d{5}(-\d{4})?$', '90210-')

This allows us to match 5 digits followed optionally by a dash and 4 digits (both the dash and 4 digits must be present).

Capture Groups

We’ve already talked about using groups to allow for quantifying a group of character patterns.

There’s actually another purpose for groups though.

Groups also allow capturing characters matched by a group.

Remember how we used the group method to access the matched data? We can pass arguments to that method to access captured groups.

For example, in our ZIP code regular expression, we can get the first matching group like this:

>>> m = re.search(r'^\d{5}(-\d{4})?$', '90210-4873')
>>> m.group(1)
'-4873'

>>> m.group()
'90210-4873'

If we want to always access just the first 5 digits, we could put those in a group:

>>> m = re.search(r'(^\d{5})(-\d{4})?$', '90210-4873')
>>> m.group(2)
'-4873'
>>> m.group(1)
'90210'

Note that if we access the 0 group that will give us the entire match, just like when we pass no arguments:

>>> m.group(0)
'90210-4873'
>>> m.group()
'90210-4873'

Regular Expression Exercises

Count Numbers

This is the count_numbers exercise in regexes.py.

Edit function count_numbers that returns a count of all numbers in a given string.

Hint

You can match a number by using a regular expression that matches one or more consecutive digits: \d+

>>> from regexes import count_numbers
>>> count_numbers(declaration)
{'4': 1, '1776': 1}
>>> count_numbers("Why was 6 afraid of 7? Because 7 8 9.")
{'7': 2, '9': 1, '6': 1, '8': 1}

Get File Extension

This is the get_extension exercise in regexes.py.

Edit the function get_extension that accepts a full file path and returns the file extension.

Example usage:

>>> from regexes import get_extension
>>> get_extension('archive.zip')
'zip'
>>> get_extension('image.jpeg')
'jpeg'
>>> get_extension('index.xhtml')
'xhtml'
>>> get_extension('archive.tar.gz')
'gz'

Normalize JPEG Extension

This is the normalize_jpeg exercise in regexes.py.

Edit the function normalize_jpeg that accepts a JPEG filename and returns a new filename with jpg lowercased without an e.

Hint

Lookup how to pass flags to the re.sub function.

Example usage:

>>> from regexes import normalize_jpeg
>>> normalize_jpeg('avatar.jpeg')
'avatar.jpg'
>>> normalize_jpeg('Avatar.JPEG')
'Avatar.jpg'
>>> normalize_jpeg('AVATAR.Jpg')
'AVATAR.jpg'

Count Punctuation

This is the count_punctuation exercise in regexes.py.

Edit the function count_punctuation that takes a string and returns a count of all punctuation characters in the string.

Punctuation characters are characters which are not word characters and are not whitespace characters

Hint

You can match punctuation characters with this regular expression: [^ \w]

>>> from regexes import count_punctuation
>>> count_punctuation("^_^ hello there! @_@")
{'^': 2, '@': 2, '!': 1}
>>> count_punctuation(declaration)
{',': 122, '.': 36, ':': 10, ';': 9, '-': 4, '—': 1, '’': 1}

Normalize Whitespace

This is the normalize_whitespace exercise in regexes.py.

Edit the function normalize_whitespace that replaces all instances of one or more whitespace characters with a single space.

Example usage:

>>> from regexes import normalize_whitespace
>>> normalize_whitespace("hello  there")
"hello there"
>>> normalize_whitespace("""Hold fast to dreams
... For if dreams die
... Life is a broken-winged bird
... That cannot fly.
...
... Hold fast to dreams
... For when dreams go
... Life is a barren field
... Frozen with snow.""")
'Hold fast to dreams For if dreams die Life is a broken-winged bird That cannot fly. Hold fast to dreams For when dreams go Life is a barren field Frozen with snow.'

Hex Colors

This is the is_hex_color exercise in regexes.py.

Edit the function is_hex_color to match hexadecimal color codes. Hex color codes consist of an octothorpe symbol followed by either 3 or 6 hexadecimal digits (that’s 0 to 9 or a to f).

Example usage:

>>> from regexes import is_hex_color
>>> is_hex_color("#639")
True
>>> is_hex_color("#6349")
False
>>> is_hex_color("#63459")
False
>>> is_hex_color("#634569")
True
>>> is_hex_color("#663399")
True
>>> is_hex_color("#000000")
True
>>> is_hex_color("#00")
False
>>> is_hex_color("#FFffFF")
True
>>> is_hex_color("#decaff")
True
>>> is_hex_color("#decafz")
False