Statistics

We have a module with some basic statistics helper functions.

def mean(x):
    t = 0
    for n in x:
        t += n
    meanavg = t / len(x)
    return meanavg


def median(x):
    if len(x)%2 != 0:
        return sorted(x)[int(len(x)/2)]
    else:
        midavg = (sorted(x)[int(len(x)/2)] + sorted(x)[int(len(x)/2-1)])/2.0
        return midavg


def mode(x):
    o = {}
    for n in x:
        if n not in o:
            o[n] = 0
        o[n] += 1
    m = max(o.values())
    modeavg = [k for (k, v) in o.items() if v == m][0]
    return modeavg


if __name__ == "__main__":
    values = [5, 8, 2, 7, 2, 60]
    assert mean(values) == 14
    assert median(values) == 6
    assert mode(values) == 2
    print("Tests passed.")

Let’s start by refactoring mean:

  1. Use descriptive and meaningful names for numbers and total: PEP 8 and “Readability counts.”

  2. Remove unnecessary variable which is immediately returned: “Simple is better than complex.”

def mean(numbers):
    total = 0
    for n in numbers:
        total += n
    return total / len(numbers)
  1. Use sum instead of a for loop to calculate total: “Simple is better than complex.”

  2. Remove unnecessary total variable

def mean(numbers):
    return sum(numbers) / len(numbers)

Now let’s work on median:

  1. Use more descriptive variable name (numbers): PEP 8 and “Readability counts.”

  2. Reduce inefficiency and repetition by only sorting numbers once: “Simple is better than complex.”

  3. Reduce repetition by making length variable

  4. Remove unnecessary variable before return

def median(numbers):
    numbers = sorted(numbers)
    length = len(numbers)
    if length%2 != 0:
        return numbers[int(length/2)]
    else:
        return (numbers[int(length/2)] + numbers[int(length/2-1)])/2.0
  1. Add more space around % operator for readability: PEP 8

  2. Add mid_point variable to reduce repetition

  3. Space out / operator and simplify 2.0 to 2: PEP 8

def median(numbers):
    numbers = sorted(numbers)
    length = len(numbers)
    mid_point = int(length/2)
    if length % 2 != 0:
        return numbers[mid_point]
    else:
        return (numbers[mid_point] + numbers[mid_point - 1]) / 2

Let’s refactor mode:

  1. Rename x to numbers, o to occurrences, and m to most

  2. Remove unnecessary modeavg variable

  3. Replace if statement with setdefault

  4. Consider replacing setdefault with get

def mode(numbers):
    occurrences = {}
    for n in numbers:
        occurrences[n] = occurrences.get(n, 0) + 1
    most = max(occurrences.values())
    return [k for (k, v) in occurrences.items() if v == most][0]
  1. Use defaultdict instead

  2. Switch from defaultdict to single-line Counter statement

from collections import Counter

def mode(numbers):
    occurrences = Counter(numbers)
    most = max(occurrences.values())
    return [k for (k, v) in occurrences.items() if v == most][0]
  1. Use most_common method to get the most commonly occurring number

  2. Consider whether to squash two lines into one return

def mode(numbers):
    value, _ = Counter(numbers).most_common(1)[0]
    return value

This actually isn’t the most Pythonic implementation of mean, median, and mode. It would be most Pythonic to just use the versions built-in to the standard library (since Python 3.4):

>>> from statistics import mean, median, mode
>>> numbers = [5, 8, 2, 7, 2, 60]
>>> mean(numbers)
14.0
>>> median(numbers)
6.0
>>> mode(numbers)
2