Introduction to Files

An Expected Failure

This program is supposed to count the number of times that People occurs in the declaration-of-independence.txt file:

f = open("declaration-of-independence.txt")
people_count = f.count("People")
print("People occurs {people_count} times.")

There’s a bug in our code though!

Try to figure out what’s going on and see if you can fix it.

Hint

Strings have a count method but that f variable doesn’t point to a string.

What type of object does f point to? And what methods and attributes does that f object have?

You may want to use the built-in breakpoint and help functions.

Reading Files

Up to now the only ways we’ve used large portions of data in our code is to put it directly into the code.

Let’s learn how to read data from files.

Let’s read from the file declaration-of-independence.txt.

>>> declaration_file = open('declaration-of-independence.txt')
>>> contents = declaration_file.read()
>>> declaration_file.close()
>>> len(contents)
8190

First we open the file, then we read the contents of the file and print them out, then we close the file.

Let’s make a program file_stats.py that will read from a file and gives us statistics on the text in a given file.

import sys


def print_file_stats(filename):
    stat_file = open(filename)
    contents = stat_file.read()
    stat_file.close()
    word_count = len(contents.split())
    print(f"Number of Words: {word_count}")

if __name__ == "__main__":
    filename = sys.argv[1]
    print_file_stats(filename)

Let’s try it out:

$ python file_stats.py declaration-of-independence.txt
Number of Words: 1342

It works!

Closing Files

We need to remember to always close our file descriptors. This isn’t as important when reading files, but will be very important when writing files.

We can make sure we always close our files by putting our file read and close in a try-finally block.

The finally block allows us to perform an action regardless of whether an exception occurred. The finally block is useful for cleanup steps, like closing files.

def print_file_stats(filename):
    stat_file = open(filename)
    try:
        contents = stat_file.read()
    finally:
        stat_file.close()
    word_count = len(contents.split())
    print(f"Number of Words: {word_count}")

This will ensure that even if an exception is raised while reading the file, our file descriptor will still be closed.

This is such a common concern in Python, that the open function supports a special syntax for this.

def print_file_stats(filename):
    with open(filename) as stat_file:
        contents = stat_file.read()
    word_count = len(contents.split())
    print(f"Number of Words: {word_count}")

This with block is called a context manager. Context managers allow us to ensure that particular cleanup tasks occur whenever a block of code is exited. Basically after our context manager block is exited, the stat_file file descriptor will be closed.

Once the file descriptor is closed, the file cannot be accessed:

def print_file_stats(filename):
    with open(filename) as stat_file:
        pass
    contents = stat_file.read()
    word_count = len(contents.split())
    print(f"Number of Words: {word_count}")

When you try to run this version of print_file_stats, Python will raise an error because the file is closed:

>>> print_file_stats("declaration-of-independence.txt")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 4, in print_file_stats
ValueError: I/O operation on closed file.

I recommend getting into the habit of always using context managers when opening files.

Mode

Files are opened in read text mode by default.

Let’s set our mode explicitly when opening our file to make it clear that we’re reading from it:

def print_file_stats(filename):
    with open(filename, mode='rt') as stat_file:
        contents = stat_file.read()
    word_count = len(contents.split())
    print(f"Number of Words: {word_count}")

Writing Files

Let’s make a program that writes to a file.

To write to a file, we need to open it with a w in the mode argument:

>>> with open('test.txt', mode='wt', encoding='utf-8') as test_file:
...     print("Hello world!", file=test_file)
...

We can print straight to a file. This is pretty convenient.

We can also call the write method on file objects to write to them:

>>> with open('test.txt', mode='wt', encoding='utf-8') as test_file:
...     test_file.write("Hello world!\n")
...
13

The write method on our file objects writes every character we give it to the file. It returns the number of characters it wrote to the file.

How is the write method different from using the print function with our file? (hint: look at the newline characters)

Encoding

When we have text saved in a file, there are bytes that represent that text. The way that our files represent text as bytes is called an encoding. There are lots of different encodings. The most popular encoding is utf-8.

Linux and Macs use utf-8 as their default encoding in Python. On Windows, the default encoding is cp1252.

Let’s make sure our program uses the same encoding on any machine by explicitly specifying it:

def print_file_stats(filename):
    with open(filename, mode='rt', encoding='utf-8') as stat_file:
        contents = stat_file.read()
    word_count = len(contents.split())
    print(f"Number of Words: {word_count}")

Now our file will be read as a UTF-8 text file, regardless of what machine our Python is running on.

This probably didn’t change anything. If we had special characters (like accented characters or emoji) we would notice a difference between UTF-8 and the default encoding when saving our files on Windows machines.

Reading Line-By-Line

If we’re reading very large files, it may not be a good idea to read the whole file all at once because we might fill up our system memory.

File objects can be looped over the same way lists, tuples, and other iterables can be looped over. Looping over a file will read it line-by-line.

>>> csv_file = open('us-state-capitals.csv')
>>> for line in csv_file:
...     print(line)
...
state,capital

Alabama,Montgomery

Alaska,Juneau

Arizona,Phoenix

Arkansas,Little Rock
...

Note that there’s an extra line break between each of these lines. The reason is that when we loop over lines in a file, each line will have a line break character at the end, but when we print text the print function inserts a second line break by default. If we tell print not to insert an extra line break at the end, we’ll see each line as it appears in the file:

>>> csv_file = open('us-state-capitals.csv')
>>> for line in csv_file:
...     print(line, end='')
...
state,capital
Alabama,Montgomery
Alaska,Juneau
Arizona,Phoenix
Arkansas,Little Rock
...

If you’re reading a file that is many megabytes large, you may want to read the file line-by-line instead.

File Exercises

Hint

If you get stuck for a minute or more, try searching Google or using help.

If you’re stuck for more than a few minutes, some of these links might be helpful for some of the exercises below:

Line Numbers

This is the line_numbers.py exercise in the modules directory. Create the file line_numbers.py in the modules sub-directory of the exercises directory. To test it, run python test.py line_numbers.py from your exercises directory.

Write a program that accepts a file as its only argument and prints out the lines in the files with a line number displayed in front of them.

Example:

If my_file.txt contains:

This file
is two lines long.
No wait, it's three lines long!

Running:

$ python line_numbers.py my_file.txt

Should print out:

This file
is two lines long.
No wait, it's three lines long!

Sort

This is the sort.py exercise in the modules directory. Create the file sort.py in the modules sub-directory of the exercises directory. To test it, run python test.py sort.py from your exercises directory.

Write a program sort.py which takes a file as input and sorts every line in the file (ASCIIbetically). The original file should be overwritten.

Example:

$ python sort.py names.txt

If file names.txt started out as:

John Licea
Freddy Colella
James Stell
Mary Carr
Doris Romito
Janet Allen
Suzanne Blevins
Chris Moczygemba
Shawn McCarty
Jennette Holt

It should end up as:

Chris Moczygemba
Doris Romito
Freddy Colella
James Stell
Janet Allen
Jennette Holt
John Licea
Mary Carr
Shawn McCarty
Suzanne Blevins

Find TODOs

This is the todos.py exercise in the modules directory. Create the file todos.py in the modules sub-directory of the exercises directory. To test it, run python test.py todos.py from your exercises directory.

Write a program that prints out every line in a file that contains the text TODO (I add TODO notes in my files to note to-dos I need to handle). Also print the line number before the line. The line numbers should be padded with zeros so that all the printed numbers are 3 digits long.

Example:

If workshop.rst contains:

This is how you make a list::

    >>> numbers = [1, 2, 3]

.. TODO explain more about what lists are!

.. TODO add section on slicing

This is how you make a tuple::

    >>> numbers = (1, 2, 3)

.. TODO explain more about tuples!

Running:

$ python todo.py workshop.rst

Should print out:

.. TODO explain more about what lists are!
.. TODO add section on slicing
.. TODO explain more about tuples!

New README

Create a program new_readme.py in the modules sub-directory of the exercises directory. To test it, run python test.py new_readme.py from your exercises directory.

Create a program that will prompt the user to enter a project name and then will create a readme.md file (in the current working directory) which includes that project name.

Here’s an example of running this program:

$ cd modules
$ python new_readme.py
Name: Skynet Lite

This should create a readme.md file that contains the project name anywhere within it:

# Skynet Lite

TODO

Make sure the file you create is called readme.md (all lowercase) as many file systems are case sensitive.

Hint

Use the input() function to prompt for user input and write to a file using the open() function with write mode.

Passphrase

This is the passphrase.py exercise in the modules directory. Create the file passphrase.py in the modules sub-directory of the exercises directory. To test it, run python test.py passphrase.py from your exercises directory.

Write a program passphrase.py that randomly generates 4-word passphrases.

When the program runs manually, you’ll need to specify a word list file which has one word on each line, like this one

Running this program with a word list file will result in 4 randomly-chosen words to be printed out:

$ python3 passphrase.py words.txt
lisa streets j rocket
$ python3 passphrase.py words.txt
salvador christians vacuum microwave
$ python3 passphrase.py words.txt
newfoundland pendant pan asus

Note

We will talk about Python’s random module in a future lesson, but here is a quick hint on how to use the random.choice function to get our random words for this exercise. The choice function accepts a sequence (a list-like object) and returns a random item from that sequence:

>>> from random import choice
>>> colors = ["purple", "blue", "green", "orange"]
>>> choice(colors)
'purple'
>>> choice(colors)
'orange'
>>> choice(colors)
'orange'

You can take the floating point number and multiply it by the length of the word file, then make that an int, and use that for your index.

Count

This is the count.py exercise in the modules directory. Create the file count.py in the modules sub-directory of the exercises directory. To test it, run python test.py count.py from your exercises directory.

Write a program that accepts a file as an argument and outputs the number of lines, words, and characters in the file. In addition, it outputs the number of characters of the longest line in the file. Note the number of characters is for the whole file, which includes the newline character line endings. For the longest line, we do not want to include the line endings; only the actual number of characters of the line.

$ python count.py my_file.txt
Lines: 2
Words: 6
Characters: 28
Longest line: 17

Reverse

This is the reverse.py exercise in the modules directory. Create the file reverse.py in the modules sub-directory of the exercises directory. To test it, run python test.py reverse.py from your exercises directory.

Write a program that reverses a file character-by-character and outputs the newly reversed text into a new file.

Example:

If my_file.txt contains:

This file
is two lines long

Running:

$ python reverse.py my_file.txt elif_ym.txt

Should make elif_ym.txt contain:

gnol senil owt si
elif sihT

Hint: review some of the interesting ways that slice works.

Concatenate

This is the concat.py exercise in the modules directory. Create the file concat.py in the modules sub-directory of the exercises directory. To test it, run python test.py concat.py from your exercises directory.

Write a program concat.py that takes any number of files as command-line arguments and sticks the files together, printing them to standard output.

If an error occurs while reading a file, the file should be skipped and an error should be printed.

Print the error messages to standard error (not standard output).

Tip

You can print to standard error like this:

>>> import sys
>>> print("this is an error", file=sys.stderr)
this is an error

Example usage of concat.py:

$ python concat.py file1.txt file2.txt file3.txt
This is file 1
[Errno 2] No such file or directory: 'file2.txt'
This is file 3