Fix CSV

Starting file: fix_csv.py.

Tests file: test_fix_csv.py.

This program takes a file that uses the “pipe” symbol as a delimiter between values and converts it to a comma-separated file.

import sys


old_filename = sys.argv[1]
new_filename = sys.argv[2]

old_file = open(old_filename)
rows = [
    line.split('|')
    for line in old_file.read().splitlines()
]

new_file = open(new_filename, mode='wt', newline='\r\n')
print("\n".join(
    ",".join(row)
    for row in rows
), file=new_file)
old_file.close()
new_file.close()

We don’t want to see file close calls in our code. It’s usually better to either exit the process and let the files close automatically or use a context manager to open files so they’ll close when our with block is exited:

import sys


old_filename = sys.argv[1]
new_filename = sys.argv[2]

with open(old_filename) as old_file:
    rows = [
        line.split('|')
        for line in old_file.read().splitlines()
    ]

with open(new_filename, mode='wt', newline='\r\n') as new_file:
    print("\n".join(
        ",".join(row)
        for row in rows
    ), file=new_file)

What is all that splitting and joining we’re doing? What are we trying to do here?

We’re trying to parse a delimited file and then generate another delimited file.

We can use the existing csv module in the Python standard library for this, with the delimiter keyword:

import csv
import sys


old_filename = sys.argv[1]
new_filename = sys.argv[2]

with open(old_filename, newline='') as old_file:
    reader = csv.reader(old_file, delimiter='|')
    rows = [line for line in reader]

with open(new_filename, mode='wt', newline='') as new_file:
    writer = csv.writer(new_file)
    writer.writerows(rows)

In fact, this passes one of our tests that was previously failing. Our CSV file now properly handles data cells with commas in them.

Note that newline='' part in our file opens makes this code work properly when writing CSV files on Windows. The Python documentation notes that we should do that when working with csv.writer.

That list comprehension to get rows in our CSV file is unnecessary. We’re trying to loop over an iterable and make a new list out of it. We can use the list constructor for that:

import csv
import sys


old_filename = sys.argv[1]
new_filename = sys.argv[2]

with open(old_filename, newline='') as old_file:
    reader = csv.reader(old_file, delimiter='|')
    rows = list(reader)

with open(new_filename, mode='wt', newline='') as new_file:
    writer = csv.writer(new_file, delimiter=',')
    writer.writerows(rows)

Is there any way to get rid of those hard-coded index lookups on sys.argv?

There is! We could use slicing and tuple unpacking to make it more clear exactly what we’re looking for from our command-line arguments. As a bonus, this will give us an error if too many arguments are supplied, which isn’t a case we were handling before.

import csv
import sys


old_filename, new_filename = sys.argv[1:]

with open(old_filename, newline='') as old_file:
    reader = csv.reader(old_file, delimiter='|')
    rows = list(reader)

with open(new_filename, mode='wt', newline='') as new_file:
    writer = csv.writer(new_file, delimiter=',')
    writer.writerows(rows)

If we wanted to make this code even shorter, we could do this:

import csv
import sys


old_filename, new_filename = sys.argv[1:]

with open(old_filename, newline='') as old_file:
    rows = list(csv.reader(old_file, delimiter='|'))

with open(new_filename, mode='wt', newline='') as new_file:
    csv.writer(new_file).writerows(rows)

Do we need to store our rows in a list in order to write them out?

We don’t. We could keep both files opened at once and pass our reader iterable directly to our writer’s writerows method:

import csv
import sys


old_filename, new_filename = sys.argv[1:]

with open(old_filename, newline='') as old_file:
    reader = csv.reader(old_file, delimiter='|')
    with open(new_filename, mode='wt', newline='') as new_file:
        csv.writer(new_file).writerows(reader)

What’s a more robust way to handle command-line arguments in Python?

The argparse library is the best way that’s built-in to the Python standard library:

from argparse import ArgumentParser
import csv


parser = ArgumentParser()
parser.add_argument('old_filename')
parser.add_argument('new_filename')
args = parser.parse_args()

with open(args.old_filename, newline='') as old_file:
    reader = csv.reader(old_file, delimiter='|')
    with open(args.new_filename, mode='wt', newline='') as new_file:
        csv.writer(new_file).writerows(reader)

This variation is neat because argparse handles the files itself, but unfortunately it won’t work for our needs on Windows:

from argparse import ArgumentParser, FileType
import csv


parser = ArgumentParser()
parser.add_argument('old_file', type=FileType('rt'))
parser.add_argument('new_file', type=FileType('wt'))
args = parser.parse_args()

reader = csv.reader(args.old_file, delimiter='|')
csv.writer(args.new_file).writerows(reader)

The reason this doesn’t work for us on Windows is that we aren’t able to pass that newline='' option to the open function here because FileType opens the file for us. This is a good option for other situations, but for our CSV manipulations, we can stick to the previous version.

Tests

You can run the tests for this exercise with:

$ python test_fix_csv.py

The test file includes several test cases to verify the CSV conversion works correctly:

test_pipe_file_to_csv_file: Tests that pipe-delimited files are correctly converted to comma-separated format
test_original_file_is_unchanged: Ensures the original file remains unmodified after conversion
test_delimiter_in_output: Tests proper handling of commas within data fields (initially failing - requires using the csv module)
test_call_with_too_many_files: Tests error handling for incorrect command-line arguments (initially failing - requires better argument parsing)

The tests demonstrate the progression from a basic string manipulation approach to a robust solution using Python’s built-in csv module. Some tests are marked with @unittest.expectedFailure because they test functionality that isn’t implemented in the initial version but becomes available as you refactor the code to use proper CSV handling.