Regular Expressions

Raw Strings

Backslashes are used to escape characters in strings.

>>> file_name = "C:\projects\nathan"
>>> file_name
'C:\\projects\nathan'
>>> print(file_name)
C:\projects
athan

If we want to use a literal backslash, we need to escape it by putting two backslashes instead:

>>> file_name = "C:\\projects\\nathan"
>>> file_name
'C:\\projects\\nathan'
>>> print(file_name)
C:\projects\nathan

We can turn off character escaping completely by using “raw” strings, which you can make by prefixing your string with an r character:

>>> file_name = r"C:\projects\nathan"
>>> file_name
'C:\\projects\\nathan'
>>> print(file_name)
C:\projects\nathan

findall

The findall function works similar to search except that instead of a match object, it returns a list of every match in a string.

>>> match = re.findall(r'\w*', greeting)
>>> match
['hello', '', 'world', '']

Note that the space and end of string result in empty strings in this list because we’re searching for zero or more word characters, and they do match zero word characters.

The + command will match the previous command to be one or more times. This will remove the empty strings since space and the end of the string do not have one or more word characters.

>>> re.findall(r'\w+', greeting)
['hello', 'world']

More regex syntax

The . command will match any character:

>>> re.findall(r'o.', greeting)
['o ', 'or']
>>> re.findall(r'.l.', greeting)
['ell', 'rld']

If we want to match a literal period character we will need to use a backslash to escape the period:

>>> re.findall(r'\.', "hi.")
['.']

We can use square brackets to match any character in a group:

>>> re.findall(r'[aeiou]', greeting)
['e', 'o', 'o']

We can also use a dash to match characters within ranges of characters:

>>> re.findall(r'[a-z]', "Hi there")
['i', 't', 'h', 'e', 'r', 'e']
>>> re.findall(r'[A-Za-z]', "Hi there")
['H', 'i', 't', 'h', 'e', 'r', 'e']
>>> re.findall(r'[0-9]', "Hi there")
[]

The caret character can be used at the beginning of a character set to denote negation. This will match anything except for lowercase letters:

>>> re.findall(r'[^a-z]', "Hi there")
['H', ' ']

split

Let’s split a string by words:

>>> sentence = "Oh what a day, what a lovely day!"
>>> from collections import Counter
>>> sentence.split()
['Oh', 'what', 'a', 'day,', 'what', 'a', 'lovely', 'day!']
>>> Counter(sentence.split())
Counter({'what': 2, 'a': 2, 'Oh': 1, 'day,': 1, 'lovely': 1, 'day!': 1})

The split method on strings splits based on whitespace characters.

With the split function in the re, we can split a string by a regular expression.

Let’s use \W to split based on one or more “non-word” characters

>>> re.split(r'\W+', sentence)
['Oh', 'what', 'a', 'day', 'what', 'a', 'lovely', 'day', '']
>>> Counter(re.split(r'\W+', sentence))
Counter({'what': 2, 'a': 2, 'day': 2, 'Oh': 1, 'lovely': 1, '': 1})
>>> Counter(filter(None, re.split(r'\W+', sentence)))
Counter({'what': 2, 'a': 2, 'day': 2, 'Oh': 1, 'lovely': 1})

We could accomplish nearly the same thing by using findall to find all words.

>>> re.findall('\w+', sentence)
['Oh', 'what', 'a', 'day', 'what', 'a', 'lovely', 'day']
>>> Counter(re.findall('\w+', sentence))
Counter({'what': 2, 'a': 2, 'day': 2, 'Oh': 1, 'lovely': 1})

match

A caret can be put at the beginning of a regular expression to denote that it should only match starting at the beginning of the target string. Similarly, a dollar sign can be put at the end of a regular expression to denote that it should only match starting at the end of the target string.

>>> re.search(r'^wo', "hello world")
>>> re.search(r'^wo', "world")
<re.Match object; span=(0, 2), match='wo'>
>>> re.search(r'^wo$', "world")
>>> re.search(r'^wo$', "wo")
<re.Match object; span=(0, 2), match='wo'>

The match function is the same as using search with a caret. So match requires that the pattern to start at the beginning of the string.

>>> re.match('what', sentence)
>>> re.match('.*what', sentence)
<re.Match object; span=(0, 19), match='Oh what a day, what'>
>>> re.match('Oh', sentence)
<re.Match object; span=(0, 2), match='Oh'>

The fullmatch function is the same as using search with both a caret and a dollar sign. So fullmath requires the pattern to match the entire string.

>>> re.fullmatch(r'what', sentence)
>>> re.fullmatch(r'.*what', sentence)
>>> re.fullmatch(r'.*what.*', sentence)
<re.Match object; span=(0, 33), match='Oh what a day, what a lovely day!'>

sub

You can use the sub function to replace parts of a string.

Let’s replace all vowels with the letter “x”:

>>> re.sub(r'[aeiou]', r"x", greeting)
'hxllx wxrld'

We can use parentheses to group parts of regular expressions. Groups can be referenced by their number in the replacement string:

>>> re.sub(r'([aeiou])', r"x\1", greeting)
'hxellxo wxorld'

We can use \d to match a digit character and \D to match a non-digit character. Curly braces can be used with a number inside to repeat the last command a certain number of times.

Let’s make a regular expression to normalize phone numbers:

>>> re.sub(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*', r"\1-\2-\3", "(202) 456-1111")
'202-456-1111'
>>> re.sub(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*', r"\1-\2-\3", "202 - 456 - 1111")
'202-456-1111'
>>> re.sub(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*', r"\1-\2-\3", "202.456.1111")
'202-456-1111'

compile

If we need to use the same regular expression multiple times throughout our program, it’s a good idea to pre-compile it to increase performance.

>>> phone_re = re.compile(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*')
>>> phone_re.sub(r"(\1) \2 - \3", "202.456.1111")
'(202) 456 - 1111'
>>> phone_re.sub(r"\1-\2-\3", "202.456.1111")
'202-456-1111'

Grouping

So far our regular expressions have consisted solely of commands that match individual letters or allow repetition of individual letter matches.

What if we want to act on a group?

For example what if we want to match US ZIP codes in their shortened form or their full form?

Here’s a regular expression for matching shortened ZIP codes:

>>> re.search(r'^\d{5}$', '90210')
<re.Match object; span=(0, 5), match='90210'>

A full ZIP code match looks like this:

>>> re.search(r'^\d{5}-\d{4}$', '90210-4873')
<re.Match object; span=(0, 10), match='90210-4873'>

So far we haven’t seen a way to make that last part optional.

We could try putting a question mark after the - and the repetition:

>>> re.search(r'^\d{5}-?\d{4}?$', '90210-4873')
<re.Match object; span=(0, 10), match='90210-4873'>
>>> re.search(r'^\d{5}-?\d{4}?$', '902104873')
<re.Match object; span=(0, 9), match='902104873'>
>>> re.search(r'^\d{5}-?\d{4}?$', '90210-')

That matches strange things though (also what’s up with that ? after the repetition count?).

To optionally match a number of consecutive character patterns, we can use a group:

>>> re.search(r'^\d{5}(-\d{4})?$', '90210-4873')
<re.Match object; span=(0, 10), match='90210-4873'>
>>> re.search(r'^\d{5}(-\d{4})?$', '90210')
<re.Match object; span=(0, 5), match='90210'>
>>> re.search(r'^\d{5}(-\d{4})?$', '902104873')
>>> re.search(r'^\d{5}(-\d{4})?$', '90210-')

This allows us to match 5 digits followed optionally by a dash and 4 digits (both the dash and 4 digits must be present).

Capture Groups

We’ve already talked about using groups to allow for quantifying a group of character patterns.

There’s actually another purpose for groups though.

Groups also allow capturing characters matched by a group.

Remember how we used the group method to access the matched data? We can pass arguments to that method to access captured groups.

For example, in our ZIP code regular expression, we can get the first matching group like this:

>>> m = re.search(r'^\d{5}(-\d{4})?$', '90210-4873')
>>> m.group(1)
'-4873'

>>> m.group()
'90210-4873'

If we want to always access just the first 5 digits, we could put those in a group:

>>> m = re.search(r'(^\d{5})(-\d{4})?$', '90210-4873')
>>> m.group(2)
'-4873'
>>> m.group(1)
'90210'

Note that if we access the 0 group that will give us the entire match, just like when we pass no arguments:

>>> m.group(0)
'90210-4873'
>>> m.group()
'90210-4873'

Regular Expression Exercises

Count Numbers

This is the count_numbers exercise in regexes.py.

Edit function count_numbers that returns a count of all numbers in a given string.

Hint

You can match a number by using a regular expression that matches one or more consecutive digits: \d+

>>> from regexes import count_numbers
>>> count_numbers(declaration)
{'4': 1, '1776': 1}
>>> count_numbers("Why was 6 afraid of 7? Because 7 8 9.")
{'7': 2, '9': 1, '6': 1, '8': 1}

Get File Extension

This is the get_extension exercise in regexes.py.

Edit the function get_extension that accepts a full file path and returns the file extension.

Example usage:

>>> from regexes import get_extension
>>> get_extension('archive.zip')
'zip'
>>> get_extension('image.jpeg')
'jpeg'
>>> get_extension('index.xhtml')
'xhtml'
>>> get_extension('archive.tar.gz')
'gz'

Normalize JPEG Extension

This is the normalize_jpeg exercise in regexes.py.

Edit the function normalize_jpeg that accepts a JPEG filename and returns a new filename with jpg lowercased without an e.

Hint

Lookup how to pass flags to the re.sub function.

Example usage:

>>> from regexes import normalize_jpeg
>>> normalize_jpeg('avatar.jpeg')
'avatar.jpg'
>>> normalize_jpeg('Avatar.JPEG')
'Avatar.jpg'
>>> normalize_jpeg('AVATAR.Jpg')
'AVATAR.jpg'

Count Punctuation

This is the count_punctuation exercise in regexes.py.

Edit the function count_punctuation that takes a string and returns a count of all punctuation characters in the string.

Punctuation characters are characters which are not word characters and are not whitespace characters

Hint

You can match punctuation characters with this regular expression: [^ \w]

>>> from regexes import count_punctuation
>>> count_punctuation("^_^ hello there! @_@")
{'^': 2, '@': 2, '!': 1}
>>> count_punctuation(declaration)
{',': 122, '.': 36, ':': 10, ';': 9, '-': 4, '—': 1, '’': 1}

Normalize Whitespace

This is the normalize_whitespace exercise in regexes.py.

Edit the function normalize_whitespace that replaces all instances of one or more whitespace characters with a single space.

Example usage:

>>> from regexes import normalize_whitespace
>>> normalize_whitespace("hello  there")
"hello there"
>>> normalize_whitespace("""Hold fast to dreams
... For if dreams die
... Life is a broken-winged bird
... That cannot fly.
...
... Hold fast to dreams
... For when dreams go
... Life is a barren field
... Frozen with snow.""")
'Hold fast to dreams For if dreams die Life is a broken-winged bird That cannot fly. Hold fast to dreams For when dreams go Life is a barren field Frozen with snow.'

Hex Colors

This is the is_hex_color exercise in regexes.py.

Edit the function is_hex_color to match hexadecimal color codes. Hex color codes consist of an octothorpe symbol followed by either 3 or 6 hexadecimal digits (that’s 0 to 9 or a to f).

Example usage:

>>> from regexes import is_hex_color
>>> is_hex_color("#639")
True
>>> is_hex_color("#6349")
False
>>> is_hex_color("#63459")
False
>>> is_hex_color("#634569")
True
>>> is_hex_color("#663399")
True
>>> is_hex_color("#000000")
True
>>> is_hex_color("#00")
False
>>> is_hex_color("#FFffFF")
True
>>> is_hex_color("#decaff")
True
>>> is_hex_color("#decafz")
False