Appendix P: pathlib

Traditionally, more time is spent discussing how to read from & write to a file than how to build the right path in the first place.

In practice, since we now typically use conventional formats like json, csv, toml, yaml, etc. – your actual code will probably dedicate more effort to manipulating file paths than calling .readlines and .write on open files.

Review: File I/O

The canonical way to read from or write to a file in Python is to use with open:

# read from a text file that already exists
with open("filename.txt") as f:
    text = f.read()

# open a new file for writing (erases existing contents)
with open("newfile.txt", "w") as f:
    f.write("hello filesystem!\n")

While it is possible to have multiple read/write statements within the block, when using a library like json it is more common to see:

with open("newfile.txt", "w") as f:
    # json.dumps returns a string, which we write to the file
    f.write(json.dumps(data))

# or more succinctly:

with open("newfile.txt", "w") as f:
    # json.dump takes a file handle as its second argument
    json.dump(data, f)

The primary purpose of the with statement is to ensure that if an error occurs within the indented with-block the file will be properly closed.

Whatever kind of text is inside your file, it is now far more common to use a library like json than to manipulate the contents yourself.

This is equally true for binary files, for example when using the popular Pillow library for manipulating images, it takes care of the file I/O itself.

Therefore, for most practical purposes, the complexities of file I/O are handled for you once you’ve managed to locate & open the correct file.

That however, is often easier said than done, many new programmer struggle with paths.

Manipulating Paths

pathlib is a relatively new addition to Python, which accounts for the fact that you’ll still see examples using less effective methods, particularly those from the os and os.path modules.

pathlib makes working with file paths much easier. It will typically be used in tandem with an open call or equivalent.

The primary thing the module contains is a type called Path

The Path class represents a single file path, the path to the file I’m writing these words in for instance might be

Path("/home/james/sites/map-python-data/pathlib/index.qmd")`

While paths may resemble strings, and can be instantiated from them, the Path class offers additional behaviors that are specific to file paths.

.parent

Path objects have a .parent property that is equivalent to going up a directory:

from pathlib import Path

path = Path("/home/user/projects/proj-1")
print(path.parent)
print(path.parent.parent)
/home/user/projects
/home/user

Concatenation

Path objects use / to concatenate parts of a path.

We can use this to build paths out of components:

BASE_DIR = Path("/home/james/sites/map-python-data")

for name in ["pathlib", "web-scraping", "debugging"]:
    # Path overrides the "/" operator to work as concatenation
    # this works with strings and Paths
    file_path = BASE_DIR / name / "index.qmd"
    print(file_path)
/home/james/sites/map-python-data/pathlib/index.qmd
/home/james/sites/map-python-data/web-scraping/index.qmd
/home/james/sites/map-python-data/debugging/index.qmd
Note

This works on Windows as well as Unix-based systems. The path separators will be converted by the library so you can use / and Windows will see \ where appropriate.

Getting the Right Path

If you’ve written a file that works with paths you may have run into issues where it doesn’t always read the correct file.

Perhaps you had code like:

with open("filename.txt") as f:
    f.read()

And found that sometimes it couldn’t find the file in question. Or if writing files, perhaps sometimes it wrote the file to a different directory than the one you expected.

The reason for this is that if a file path does not start with the root / (or C:/ on Windows) it is relative.

These paths will be interpreted as if they begin with the current working directory.

This is an opaque concept, and a perfect example of why we tell you to avoid global variables.

Every running program has a global variable representing the “current working directory”, often the directory it was run from. When you are in your terminal you can see your terminal’s current working directory by typing pwd. Similarly Python has functions to let you examine (os.getcwd) and change (os.chdir) the current working directory.

As you may recall, global variables can make it hard to reason about programs, since any function might modify them in unexpected ways.

# global variables create hard-to-follow code
some_variable = 100
f()
g()
h()
print(some_variable)

What will print? That depends on what f, g, and h do to the global state!

As we’ll see, the key to robust file-handling that works equally well on your system as it does on your peers’ is to generally avoid using this global state altogether.

Absolute Paths

One solution to this problem is to use absolute paths, you may find that instead of open("data/target.json") you can get your code to work when you use open("/home/user/projects/proj-2/data/target.json").

But this path is unique to your computer. On my machine I may need "/home/james/dev/proj2/data/target.json".

How can we do this without constantly dueling edits in our Git repository?

__file__

If we’re concerned about portability we want to have a way to say “the directory next to this one” or “the directory that is a parent of this one”.

Often we’re trying to create a layout like this:

proj-dir/
├── data
│   └── target.json
└── src
    └── script.py

script.py would like to be able to write to data/target.json in a reliable way regardless of what the current working directory is.

We’d like to do this without knowing exactly where proj-dir is as well, since it may be in /Users/james/projects on one machine and /home/stephen/my-homework on another.

To do this, we can define our paths using the relationship between the two files.

The algorithm for doing this is:

  1. Have the Python file get the path to itself.
  2. Determine the relative path from the Python file in question to the data file.
  3. Use pathlib to combine these.

Python has a special variable __file__ that’ll help with step 1, and the rest of the steps we can do with standard path operators:

# assume we're in /home/james/projects/proj-dir/src/script.py
from pathlib import Path

# this creates a Path object that is the full path to script.py
# and then uses .parent to go up one level, to
# "/home/james/projects/proj-dir/src/"
BASE_DIR = Path(__file__).parent

# Combine that path with a relative path from 'src'
# to the file in question.
#  - up one directory, then into the data directory
data_path = BASE_DIR / "../data/target.json"

Forming paths using __file__ makes them consistent as long as the .py files do not move relative to the data.

Using Path objects

Path objects can typically be passed in anywhere a filename is expected, so open("filename.txt", "w") can become open(path_obj, "w"). You can also write this as path.open("w"). (See pathlib.Path.open.)

Path objects also have quite a few helper methods that can make your life easier:

Path.exists

If you want to check if a given file exists, you can construct a path to it and then call .exists:

path = BASE_DIR / "data.csv"
if path.exists():
    read_and_process(path)
else:
    create_initial_data(path)

Path.mkdir

A common pattern is to want to create a directory if it doesn’t exist:

log_directory = BASE_DIR / "logs"
log_directory.mkdir(exist_ok=True, parents=True)

This also demonstrates two useful parameters:

  • exist_ok=True makes it so that the function will not raise an error if the directory already exists.
  • parents=True will also create parent directories if needed.

Quick Reading/Writing

If you are reading/writing the entire file in one go, instead of using the IO object returned by open, you can call read_text and write_text directly on the Path as a shortcut.

p = Path("file.txt")
p.write_text('Text file contents')
p.read_text()
'Text file contents'

Further Exploration

See the official pathlib documentation for more methods and examples.


  1. If you look at the documentation, you’ll see a few related classes like PurePath and PosixPath. You can ignore those differences for the most part and use Path.↩︎