CSV Files

CSV Reader

Python’s standard library includes a csv module with utilities for helping us read and write CSV files. Let’s use this to parse us-state-capitals.csv.

>>> import csv
>>> capitals = {}
>>> with open('us-state-capitals.csv', newline='') as csv_file:
...     csv_reader = csv.reader(csv_file)
...     for state, capital in csv_reader:
...         capitals[state] = capital
>>> capitals
{'state': 'capital', 'Alabama': 'Montgomery', 'Alaska': 'Juneau', 'Arizona': 'Phoenix', 'Arkansas': 'Little Rock', 'California': 'Sacramento', 'Colorado': 'Denver', 'Connecticut': 'Hartford', 'Delaware': 'Dover', 'Hawaii': 'Honolulu', 'Florida': 'Tallahassee', 'Georgia': 'Atlanta', 'Idaho': 'Boise', 'Illinois': 'Springfield', 'Indiana': 'Indianapolis', 'Iowa': 'Des Moines', 'Kansas': 'Topeka', 'Kentucky': 'Frankfort', 'Louisiana': 'Baton Rouge', 'Maine': 'Augusta', 'Maryland': 'Annapolis', 'Massachusetts': 'Boston', 'Michigan': 'Lansing', 'Minnesota': 'St. Paul', 'Mississippi': 'Jackson', 'Missouri': 'Jefferson City', 'Montana': 'Helena', 'Nebraska': 'Lincoln', 'Nevada': 'Carson City', 'New Hampshire': 'Concord', 'New Jersey': 'Trenton', 'New Mexico': 'Santa Fe', 'North Carolina': 'Raleigh', 'North Dakota': 'Bismarck', 'New York': 'Albany', 'Ohio': 'Columbus', 'Oklahoma': 'Oklahoma City', 'Oregon': 'Salem', 'Pennsylvania': 'Harrisburg', 'Rhode Island': 'Providence', 'South Carolina': 'Columbia', 'South Dakota': 'Pierre', 'Tennessee': 'Nashville', 'Texas': 'Austin', 'Utah': 'Salt Lake City', 'Vermont': 'Montpelier', 'Virginia': 'Richmond', 'Washington': 'Olympia', 'West Virginia': 'Charleston', 'Wisconsin': 'Madison', 'Wyoming': 'Cheyenne'}
>>> len(capitals)
51
>>> capitals['state']
'capital'

We ended up with 51 state-capital pairs because the first row of the file is the header row state,capital. When there is a header row in the CSV file, we want to grab it first before we read the data.

We can change our code to read the header first, like this:

>>> import csv
>>> capitals = {}
>>> with open('us-state-capitals.csv', newline='') as csv_file:
...     csv_reader = csv.reader(csv_file)
...     headers = next(csv_reader)
...     for state, capital in csv_reader:
...         capitals[state] = capital
>>> headers
['state', 'capital']
>>> len(capitals)
50

That next function skips over the first row of our CSV file.

CSV Writer

The CSV writer will take the data we give it and write a CSV file:

>>> import csv
>>> colors = [("purple", "0.15"), ("indigo", "0.25"), ("red", "0.3")]
>>> with open('colors.csv', mode='wt', newline='') as colors_file:
...     csv_writer = csv.writer(colors_file)
...     for color, ratio in colors:
...         csv_writer.writerow((color, ratio))
13
13
9

If we read from colors.csv, we’ll see our colors were written to it.

>>> with open('colors.csv') as colors_file:
...     print(colors_file.read())
...
purple,0.15
indigo,0.25
red,0.3

Note

Note that we’re passing a newline='' argument to open. This is particularly important when writing CSV files.

The documentation recommends passing newline='' when open a file to be written (or even read) as a CSV file to tell Python not to automatically convert \n characters to \r\n on Windows.

By default when writing to a file on Windows, Python’s file objects converts \n line endings to \r\n. But Python’s csv.writer utility ends each line with \r\n (the Windows-style line ending of carriage return followed by a line feed). So without newline='' our code would work on Linux and Mac but on Windows we would end up doubling-up our carriage returns (\r\r\n), which isn’t correct. Many CSV readers may show an empty line between each line due to these doubled-up carriage returns. So follow the documentation and always use newlin='' when working with CSV files if your Python code might ever run on a Windows machine.

Delimiters and Quote Characters

The CSV module isn’t just for reading comma-delimited files. We can read files with any delimiters and any quote characters.

Comma is the default delimiter and double quote is the default quote character.

Here’s an example of writing a CSV file with a tab delimiter:

>>> import csv
>>> with open('colors.tsv', mode='wt', newline='') as colors_file:
...     writer = csv.writer(colors_file, delimiter='\t')
...     writer.writerows(colors)
...

If we go to the command line and display the file, it looks like this:

$ cat colors.tsv
purple  0.15
indigo  0.25
red 0.3

CSV Headers

So CSV reader gives us a list of lists and CSV writer expects us to pass it something similar.

What if we want to refer to our CSV rows by their column names?

We can use DictReader and DictWriter for that.

>>> import csv
>>> capitals = {}
>>> with open('us-state-capitals.csv', newline='') as csv_file:
...     reader = csv.DictReader(csv_file)
...     rows = list(reader)
>>> rows[:2]
[{'state': 'Alabama', 'capital': 'Montgomery'}, {'state': 'Alaska', 'capital': 'Juneau'}]

The DictReader knows to interpret the first row of the file as headers for the file, and you can reference the header names.

>>> import csv
>>> capitals = {}
>>> with open('us-state-capitals.csv', newline='') as csv_file:
...     reader = csv.DictReader(csv_file)
...     for row in reader:
...         capitals[row['capital']] = row['state']
...
>>> capitals
{'Montgomery': 'Alabama', 'Juneau': 'Alaska', 'Phoenix': 'Arizona', 'Little Rock': 'Arkansas', 'Sacramento': 'California', 'Denver': 'Colorado', 'Hartford': 'Connecticut', 'Dover': 'Delaware', 'Honolulu': 'Hawaii', 'Tallahassee': 'Florida', 'Atlanta': 'Georgia', 'Boise': 'Idaho', 'Springfield': 'Illinois', 'Indianapolis': 'Indiana', 'Des Moines': 'Iowa', 'Topeka': 'Kansas', 'Frankfort': 'Kentucky', 'Baton Rouge': 'Louisiana', 'Augusta': 'Maine', 'Annapolis': 'Maryland', 'Boston': 'Massachusetts', 'Lansing': 'Michigan', 'St. Paul': 'Minnesota', 'Jackson': 'Mississippi', 'Jefferson City': 'Missouri', 'Helena': 'Montana', 'Lincoln': 'Nebraska', 'Carson City': 'Nevada', 'Concord': 'New Hampshire', 'Trenton': 'New Jersey', 'Santa Fe': 'New Mexico', 'Raleigh': 'North Carolina', 'Bismarck': 'North Dakota', 'Albany': 'New York', 'Columbus': 'Ohio', 'Oklahoma City': 'Oklahoma', 'Salem': 'Oregon', 'Harrisburg': 'Pennsylvania', 'Providence': 'Rhode Island', 'Columbia': 'South Carolina', 'Pierre': 'South Dakota', 'Nashville': 'Tennessee', 'Austin': 'Texas', 'Salt Lake City': 'Utah', 'Montpelier': 'Vermont', 'Richmond': 'Virginia', 'Olympia': 'Washington', 'Charleston': 'West Virginia', 'Madison': 'Wisconsin', 'Cheyenne': 'Wyoming'}

When using csv.DictWriter to write a CSV file, we have to specify the field names in order:

>>> with open('us-state-capitals.csv', mode='w', newline='') as csv_file:
...     writer = csv.DictWriter(csv_file, fieldnames=['state', 'capital'])
...     writer.writeheader()
...     writer.writerows(rows)
...
15

CSV Exercises

Note

If you are using a Windows machine, you might have issues with double-spacing of lines. If so, add newline='' to your file open() statement when opening the files for writing.

Total Air Travel

This is the total_air_travel.py exercise in the modules directory. Create the file total_air_travel.py in the modules sub-directory of the exercises directory. To test it, run python test.py total_air_travel.py from your exercises directory.

To test this manually, use the file expenses.csv.

Given a CSV file containing expenses by category, I’d like you to calculate how much money was spent on the category “Air Travel”.

The file is formatted like this:

Date,Merchant,Cost,Category
1/05/2017,American Airlines,519.25,Air Travel
1/12/2017,Southwest Airlines,298.90,Air Travel
1/17/2017,Mailchimp,19.80,Software
2/01/2017,Zapier,15.00,Software
2/05/2017,Lyft,24.24,Ground Transport
2/06/2017,Hattie Bs,18.13,Food
2/06/2017,Lyft,15.65,Ground Transport

The columns are:

The date of the expense
Merchant
Amount paid
Category

Your program should accept a single CSV file as input and it should print a floating point number representing the sum of the amounts paid for all “Air Travel” category expenses.

To test it manually, cd to the modules directory and run:

$ python total_air_travel.py expenses.csv
818.15

Pipe to Comma

This is the pipe_to_comma.py exercise in the modules directory. Create the file pipe_to_comma.py in the modules sub-directory of the exercises directory. To test it, run python test.py pipe_to_comma.py from your exercises directory.

Write a program pipe_to_comma.py that reads a pipe-delimited CSV file and converts it to a comma-delimited CSV file. The pipe symbol is |.

$ python pipe_to_comma.py input_file.psv output_file.csv

The file output_file will contain the data from input_file, with the delimiter changed to commas.

Ungroup

This is the ungroup.py exercise in the modules directory. Create the file ungroup.py in the modules sub-directory of the exercises directory. To test it, run python test.py ungroup.py from your exercises directory.

Write a program ungroup.py that reads a CSV file that has some empty fields in its first column and fills those fields with the last value from the first column.

If this program is run like this:

$ python ungroup.py input_file.csv output_file.csv

And input_file.csv contains this:

Category,Item,Expense
Swag,rubber ducks,$50
,stickers,$150
,balloons,$40
Food,breakfast,$300
,lunch,$600
,coffee,$130

Then output_file.csv will be written with these contents:

Category,Item,Expense
Swag,rubber ducks,$50
Swag,stickers,$150
Swag,balloons,$40
Food,breakfast,$300
Food,lunch,$600
Food,coffee,$130

Sort By Column

This is the sort_by_column.py exercise in the modules directory. Create the file sort_by_column.py in the modules sub-directory of the exercises directory. To test it, run python test.py sort_by_column.py from your exercises directory.

The program sort_by_column.py takes a filename of a CSV file as input, plus the output file name and a number representing the column to sort. It reads the CSV file, sorts the file by the given column (starting at 0 since we’re computer people), and writes the file to the output file. Assume the CSV file has 1 single header row.

$ python sort_by_column.py in_filename out_filename 0

The file file_name will be modified so the second column is sorted.

Re-order

This is the reorder.py exercise in the modules directory. Create the file reorder.py in the modules sub-directory of the exercises directory. To test it, run python test.py reorder.py from your exercises directory.

Create a program reorder.py that takes an input file and output file names as input on the command line. It should read a CSV file, swap the first and second columns, and write the CSV back out to the new file.

$ python reorder.py input_file_name output_file_name

The file output_file_name will contain the data from input_file_name, with the columns swapped.