Collections

You can make your own data structures in Python and there are libraries and code snippets you can find online that do just that. Python’s standard library actually comes with some more data structures in it as well, in the collections module.

Let’s take a look at some of the useful dictionary-like types in the collections module.

NamedTuple

The NamedTuple is not part of the collections module, but in the typing module; added in version 3.5. However, it is related in that it is a data structure for a collection of objects. Sometimes you want to have a collection of data values that can be stored as tuples, but would like to be able to reference the members of the tuple by name instead of index.

Let’s make a function that splits a name based on the first space character:

>>> def split_name(name):
...     first, last = name.rsplit(" ", 1)
...     return first, last
...

If someone calls our function, they’ll have a tuple of two values. They could probably guess what the values are, but what if they weren’t sure?

>>> split_name("Trey Hunner")
('Trey', 'Hunner')

We can use NamedTuple to make the returned value more informative:

>>> from typing import NamedTuple
>>> class Name(NamedTuple):
...     """Class to store first and last names."""
...     first: str
...     last: str
...
>>> def split_name(name):
...     first, last = name.rsplit(" ", 1)
...     return Name(first, last)
...
>>> split_name("Trey Hunner")
Name(first='Trey', last='Hunner')

We created a new class that inherits from NamedTuple in the typing standard module. One of the big advantages of using NamedTuple is that you specify type-hinting of the members for well-documented code. We have told the NamedTuple that the members of our class are both strings. If you are using a static type checker like mypy, it can analyze the code to enforce the data types.

We can use this new value just like a tuple (we can unpack it) but we can also use it like an object with attributes:

>>> name = split_name("Trey Hunner")
>>> first_name, last_name = name
>>> first_name
'Trey'
>>> last_name
'Hunner'
>>> name.first
'Trey'
>>> name.last
'Hunner'

With NamedTuple, one of the things you get for free is a nice __repr__:

>>> name
Name(first='Trey', last='Hunner')

Whenever you need to wrap some values together, think of NamedTuple:

>>> class Location(NamedTuple):
...     name: str
...     lat: float
...     long: float
...
>>>
>>> location = Location('San Diego', 32.733999, -117.147777)
>>> location
Location(name='San Diego', lat=32.733999, long=-117.147777)

Note

The collections module has a namedtuple, which is similar to this, and you may see it in existing code. The NamedTuple is like a supercharged namedtuple.

defaultdict

A defaultdict is a dictionary that uses a constructor to create a default value for each key. This means that when you access a missing key, it’s added automatically.

Let’s take a list of animals:

>>> from collections import defaultdict
>>> animals = [("agatha", "dog"), ("kurt", "cat"), ("margaret", "mouse"), ("cory", "cat"), ("mary", "mouse")]
>>> animal_counts = defaultdict(int)
>>> for name, animal_type in animals:
...     animal_counts[animal_type] += 1
...
>>> animal_counts
defaultdict(<class 'int'>, {'dog': 1, 'mouse': 2, 'cat': 2})
>>> animal_counts['dog']
1
>>> animal_counts['cat']
2

Accessing a key makes it exist:

>>> 'bird' in animal_counts
False
>>> animal_counts['bird']
0
>>> animal_counts
defaultdict(<class 'int'>, {'dog': 1, 'bird': 0, 'mouse': 2, 'cat': 2})
>>> 'bird' in animal_counts
True

A defaultdict has all the same methods that are in a standard dict:

>>> animal_counts.pop("bird")
0
>>> animal_counts
defaultdict(<class 'int'>, {'mouse': 2, 'cat': 2, 'dog': 1})
>>> animal_counts.keys()
dict_keys(['dog', 'mouse', 'cat'])

Let’s see what happens when we use list as our constructor for a defaultdict:

>>> things = defaultdict(list)
>>> things['red']
[]
>>> things['red'].append("ball")
>>> things['red']
['ball']
>>> things['red'].append("ball")
>>> things['red']
['ball', 'ball']
>>> things
defaultdict(<class 'list'>, {'red': ['ball', 'ball']})
>>> things['blue'].append("shoe")
>>> things
defaultdict(<class 'list'>, {'blue': ['shoe'], 'red': ['ball', 'ball']})

What if we want to group the animals by type? For example, we could make a dictionary that has animal types as keys and a list of animal names as values.

We could make a defaultdict with the default value being a new set. We’re using a set because the order of these animal names doesn’t matter and we don’t have any animals with the same names.

>>> animals
[('agatha', 'dog'), ('kurt', 'cat'), ('margaret', 'mouse'), ('cory', 'cat'), ('mary', 'mouse')]
>>> animals_by_type = defaultdict(set)
>>> for name, animal_type in animals:
...     animals_by_type[animal_type].add(name)
...
>>> animals_by_type
defaultdict(<class 'set'>, {'dog': {'agatha'}, 'mouse': {'margaret', 'mary'}, 'cat': {'kurt', 'cory'}})
>>> animals_by_type['cat']
{'kurt', 'cory'}

Whenever you see code that checks if a key is not in a dictionary and puts it in, consider using a defaultdict instead.

For example, let’s count the number of occurrences of each word starting with “P” in the poem Peter Piper:

>>> poem = """Peter Piper picked a peck of pickled peppers.
... A peck of pickled peppers Peter Piper picked.
... If Peter Piper picked a peck of pickled peppers,
... Where's the peck of pickled peppers that Peter Piper picked?"""

You can copy-paste the poem from here:

Peter Piper picked a peck of pickled peppers.
A peck of pickled peppers Peter Piper picked.
If Peter Piper picked a peck of pickled peppers,
Where's the peck of pickled peppers that Peter Piper picked?

We could solve this with an if statement checking whether each word is in the dict yet:

>>> p_words = {}
>>> for word in poem.split():
...     if "p" in word.lower():
...         if word not in p_words:
...             p_words[word] = 0
...         p_words[word] += 1
...
>>> p_words
{'Piper': 4, 'peck': 4, 'picked': 2, 'peppers.': 1, 'peppers,': 1, 'peppers': 2, 'pickled': 4, 'picked?': 1, 'picked.': 1, 'Peter': 4}

We could do this with a defaultdict instead:

>>> p_words = defaultdict(int)
>>> for word in poem.split():
...     if "p" in word.lower():
...         p_words[word] += 1
...
>>> p_words
defaultdict(<class 'int'>, {'Piper': 4, 'peck': 4, 'picked': 2, 'peppers.': 1, 'peppers,': 1, 'peppers': 2, 'pickled': 4, 'picked?': 1, 'picked.': 1, 'Peter': 4})

The defaultdict constructor function does not need to return the same default every time. It could be random:

>>> import random
>>> from collections import defaultdict
>>> default_colors = ['red', 'green', 'blue', 'purple']
>>> random.choice(default_colors)
'red'
>>> random.choice(default_colors)
'blue'
>>> avatar_colors = defaultdict(lambda: random.choice(default_colors))
>>> avatar_colors["trey"] = "red"
>>> avatar_colors
defaultdict(<function <lambda> at 0x7f912cd816a8>, {'trey': 'red'})
>>> avatar_colors["peter"]
'red'
>>> avatar_colors
defaultdict(<function <lambda> at 0x7f912cd816a8>, {'trey': 'red', 'peter': 'red'})
>>> avatar_colors["diane"]
'green'
>>> avatar_colors
defaultdict(<function <lambda> at 0x7f912cd816a8>, {'trey': 'red', 'peter': 'red', 'diane': 'green'})
>>> avatar_colors["diane"] = "purple"
>>> avatar_colors
defaultdict(<function <lambda> at 0x7f912cd816a8>, {'trey': 'red', 'peter': 'red', 'diane': 'purple'})

Counter

Counters are a sort special purpose dictionary used for keeping track of how many times things occur.

>>> from collections import Counter
>>> coins = ["quarter", "dime", "quarter", "penny", "nickel", "dime", "penny", "quarter"]
>>> coin_counts = Counter(coins)
>>> coin_counts
Counter({'quarter': 3, 'penny': 2, 'dime': 2, 'nickel': 1})

You can use any iterable with a Counter.

>>> letters = Counter("hello world")
>>> letters
Counter({'l': 3, 'o': 2, 'r': 1, 'w': 1, ' ': 1, 'e': 1, 'h': 1, 'd': 1})

You can ask for the most common occurrences also.

>>> letters.most_common(3)
[('l', 3), ('o', 2), ('r', 1)]
>>> coin_counts.most_common(1)
[('quarter', 3)]
>>> coin_counts.most_common()
[('quarter', 3), ('penny', 2), ('dime', 2), ('nickel', 1)]

You can also ask a Counter to give you all elements the number of times they occur, output into an iterable:

>>> coin_counts.elements()
<itertools.chain object at 0x7f912cd759e8>
>>> list(coin_counts.elements())
['quarter', 'quarter', 'quarter', 'nickel', 'penny', 'penny', 'dime', 'dime']

Of course these are not ordered in a predictable way:

>>> "".join(letters.elements())
'rwoo ehllld'

We can change values in a Counter:

>>> coin_counts
Counter({'quarter': 3, 'penny': 2, 'dime': 2, 'nickel': 1})
>>> coin_counts.update(["penny", "penny"])
>>> coin_counts
Counter({'penny': 4, 'quarter': 3, 'dime': 2, 'nickel': 1})
>>> coin_counts.subtract(["nickel", "penny", "penny"])
>>> coin_counts
Counter({'quarter': 3, 'penny': 2, 'dime': 2, 'nickel': 0})

We can also add or change values directly:

>>> coin_counts
Counter({'quarter': 3, 'penny': 2, 'dime': 2, 'nickel': 0})
>>> coin_counts['nickel'] = 1
>>> coin_counts
Counter({'quarter': 3, 'penny': 2, 'dime': 2, 'nickel': 1})
>>> coin_counts["toonie"] = 2
>>> coin_counts
Counter({'quarter': 3, 'penny': 2, 'dime': 2, 'toonie': 2, 'nickel': 1})

Like a defaultdict, counters default item values to zero:

>>> letters['z']
0
>>> letters['z'] += 10
>>> letters.most_common(1)
[('z', 10)]
>>> letters['z']
10

Remember that example we used earlier of counting all words with the letter “P” in the Peter Piper poem? That’s even easier with a Counter.

First let’s make the poem:

>>> poem = """Peter Piper picked a peck of pickled peppers.
... A peck of pickled peppers Peter Piper picked.
... If Peter Piper picked a peck of pickled peppers,
... Where's the peck of pickled peppers that Peter Piper picked?"""

Now we can find the counts of all P words with just one line of code:

>>> p_words = Counter(w for w in poem.split() if "p" in w.lower())
>>> p_words
Counter({'Piper': 4, 'peck': 4, 'pickled': 4, 'Peter': 4, 'picked': 2, 'peppers': 2, 'peppers.': 1, 'peppers,': 1, 'picked?': 1, 'picked.': 1})

Other collections

A few other interesting data structures in the collections module:

Data Classes

In version 3.7, a new standard library module was introduced, dataclasses, that is basically a factory code generator to make classes without having to do your own boilerplate things like __repr__, comparisons, and things like that. It is (superficially) a bit similar to the way NamedTuple works, but is more powerful and very customizable.

It uses the @dataclass decorator to make a class into a powerful dataclass. A simple example is very similar to NamedTuple we already saw:

>>> from dataclasses import dataclass
>>>
>>> @dataclass
... class Location:
...     name: str
...     lat: float
...     long: float
...
>>> location = Location('San Diego', 32.733999, -117.147777)
>>> location
Location(name='San Diego', lat=32.733999, long=-117.147777)

If you like the idea of named tuples, but you want more flexibility (something that helps you generate a class but doesn’t make an immutable tuple), definitely look into dataclasses.

Here’s a class that represents three dimensional coordinates:

from dataclasses import astuple, dataclass


@dataclass
class Point:
    """Three-dimensional point."""
    x: float
    y: float
    z: float
    def __iter__(self):
        yield from astuple(self)
>>> p1 = Point(1, 2, 3)
>>> p1
Point(x=1, y=2, z=3)
>>> p2 = Point(4, 5, 6)

This is equivalent to:

class Point:
    """Three-dimensional point."""
    def __init__(self, x, y, z):
        self.x, self.y, self.z = x, y, z
    def __repr__(self):
        """Return dev-readable representation of Point."""
        return f"Point(x={self.x}, y={self.y}, z={self.z})"
    def __eq__(self, other):
        """Return True if our point is equal to the other point."""
        return (self.x, self.y, self.z) == (other.x, other.y, other.z)
    def __iter__(self):
        yield from (self.x, self.y, self.z)

If you want to make these Point objects immutable, you can pass a frozen attribute to the dataclass decorator:

from dataclasses import astuple, dataclass


@dataclass(frozen=True)
class Point:
    """Three-dimensional point."""
    x: float
    y: float
    z: float
    def __iter__(self):
        yield from astuple(self)

Now our point is immutable:

>>> p = Point(1, 2, 3)
>>> p.x = 4
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 3, in __setattr__
dataclasses.FrozenInstanceError: cannot assign to field 'x'

The new dataclasses library is essentially a more stripped-down version of the third-party attrs library. If you want more functionality, look into attrs as well.

Collection Exercises

Count Words

This is the count_words exercise in dictionaries.py.

Modify the function count_words so that it accepts a string and returns a dictionary noting the number of times each word occurs in the string.

As a bonus exercise, ignore case and remove commas and other symbols between words.

Hint

Consider using a Counter or defaultdict from the collections module.

Note

To run tests for the Bonus exercise, open dictionaries_test.py, find the CountWordsTests class and comment out the @unittest.skip lines from the appropriate test methods.

Maximums

This is the MaxCounter exercise in classes.py.

Modify the class MaxCounter so that it acts just like the Counter class but MaxCounter objects should have a max_keys method that provides an iterable of all the keys that occur the most number of times.

For example:

>>> from classes import MaxCounter
>>> counts = MaxCounter('string of various words')
>>> counts
MaxCounter({'s': 3, 'r': 3, ' ': 3, 'o': 3, 'i': 2, 't': 1, 'n': 1, 'g': 1, 'f': 1, 'v': 1, 'a': 1, 'u': 1, 'w': 1, 'd': 1})
>>> counts.max_keys()
['s', 'r', ' ', 'o']

Ice Cream DataClasses

Rewrite the Size, Flavor, and IceCream classes in classes.py using data classes.

You can test each of these classes by running tests for Size, Flavor, and IceCream.

Most common

This is the get_most_common exercise in sets.py.

Create a function that accepts a list of iterables and returns a set of the most common items from all of the given iterables.

>>> get_most_common([{1, 2}, {2, 3}, {3, 4}])
{2, 3}

Hint

Consider using a Counter or defaultdict from the collections module.

Flipping Dictionary of Lists

This is the flip_dict_of_lists exercise in collection.py.

Write a function that takes a dictionary of lists and returns a new dictionary containing the list items as keys and the original dictionary keys as list values.

Example:

>>> restaurants_by_people = {
...     'diane': {'Siam Nara', 'Punjabi Tandoor', 'Opera'},
...     'peter': {'Karl Strauss', 'Opera', 'Habaneros'},
...     'trey': {'Habaneros', 'Karl Strauss', 'Opera', 'Punjabi Tandoor'},
... }
>>> favorite_restaurants = flip_dict_of_lists(restaurants_by_people)
>>> favorite_restaurants
{'Siam Nara': ['diane'], 'Karl Strauss': ['trey', 'peter'], 'Opera': ['diane', 'trey', 'peter'], 'Punjabi Tandoor': ['diane', 'trey'], 'Habaneros': ['trey', 'peter']}

Hint

Consider using a defaultdict from the collections module.

Deal Cards

These are the get_cards, shuffle_cards and deal_cards exercises in collection.py.

Create three functions:

  • get_cards: returns a list of namedtuples representing cards. Each card should have suit and rank.

  • shuffle_cards: accepts a list of cards as its argument and shuffles the list of cards in-place

  • deal_cards: accepts a number as its argument, removes the given number of cards from the end of the list and returns them

Examples:

>>> from collection import get_cards, shuffle_cards, deal_cards
>>> deck = get_cards()
>>> deck[:14]
[Card(rank='A', suit='spades'), Card(rank='2', suit='spades'), Card(rank='3', suit='spades'), Card(rank='4', suit='spades'), Card(rank='5', suit='spades'), Card(rank='6', suit='spades'), Card(rank='7', suit='spades'), Card(rank='8', suit='spades'), Card(rank='9', suit='spades'), Card(rank='10', suit='spades'), Card(rank='J', suit='spades'), Card(rank='Q', suit='spades'), Card(rank='K', suit='spades'), Card(rank='A', suit='hearts')]
>>> len(deck)
52
>>> shuffle_cards(deck)
>>> deck[-5:]
[Card(rank='9', suit='diamonds'), Card(rank='6', suit='hearts'), Card(rank='7', suit='diamonds'), Card(rank='K', suit='spades'), Card(rank='7', suit='clubs')]
>>> hand = deal_cards(deck)
>>> hand
[Card(rank='9', suit='diamonds'), Card(rank='6', suit='hearts'), Card(rank='7', suit='diamonds'), Card(rank='K', suit='spades'), Card(rank='7', suit='clubs')]
>>> len(deck)
47
>>> deck[-5:]
[Card(rank='5', suit='spades'), Card(rank='Q', suit='clubs'), Card(rank='Q', suit='spades'), Card(rank='2', suit='diamonds'), Card(rank='6', suit='clubs')]

Memory-efficient CSV

This is the parse_csv exercise in collection.py.

Using DictReader to read CSV files is convenient because CSV columns can be referenced by name (instead of positional order). However there are some downsides to using DictReader. CSV column ordering is lost because dictionaries are unordered. The space required to store each row is also unnecessarily large because dictionaries are not a very space-efficient data structure.

There is discussion of adding a NamedTupleReader to the csv module, but this hasn’t actually happened yet.

In the meantime, it’s not too difficult to use a csv.reader object to open a CSV file and then use a namedtuple to represent each row. See the collections.namedtuple for information about the original namedtuple.

Create a function parse_csv that accepts a file object which contains a CSV file (including a header row) and returns a list of namedtuples representing each row.

Example with us-state-capitals.csv:

>>> with open('us-state-capitals.csv') as csv_file:
...     csv_rows = parse_csv(csv_file)
...
>>> csv_rows[:3]
[Row(state='Alabama', capital='Montgomery'), Row(state='Alaska', capital='Juneau'), Row(state='Arizona', capital='Phoenix')]