Efficiently listing files in a tree with Python

2020-04-02

Summary

What’s the most efficient way to recursively list all files under a directory in Python? This came up while I was looking in to a data set at work; the data layout that we got from our client was split up into a prefixed directory hierarchy three levels deep; each level had about 50 folders in it and there were about 300K files at the bottom of the hierarchy. (Most of the files were in a folder by themselves, so you’d think they could have truncated that a bit, but oh well.) Here are some different approaches and timings. Everything is running on macOS 10.14.6 running on a MacBook Pro (13-inch, 2017, Two Thunderbolt 3 ports).

Python’s os.walk worked the best in my timings and for others on StackOverflow, and it seems to be about 20% slower than the system find utility.

Timings

You’d figure that for a simple use case like this, the system’s /usr/bin/find utility should be about as fast as you can get, so here are timings for that.

% time find . -name "*.rdf" > /dev/null
find . -name "*.rdf" > /dev/null  2.98s user 37.37s system 75% cpu 53.649 total

% time find . -name "*.rdf" > /dev/null
find . -name "*.rdf" > /dev/null  2.71s user 32.42s system 75% cpu 46.517 total

The handier fd utility uses multiple threads by default, so it does noticably better — but if you limit it to a single thread, it does noticably worse than /usr/bin/find.

% time fd -e rdf > /dev/null
fd -e rdf > /dev/null  13.00s user 66.04s system 241% cpu 32.686 total

% time fd -e rdf > /dev/null           
fd -e rdf > /dev/null  11.86s user 54.77s system 225% cpu 29.573 total

% time fd -j 1 -e rdf > /dev/null      
fd -j 1 -e rdf > /dev/null  10.10s user 43.24s system 62% cpu 1:25.31 total

% time fd -j 1 -e rdf > /dev/null
fd -j 1 -e rdf > /dev/null  11.16s user 52.47s system 63% cpu 1:40.98 total

Based on this StackOverflow question, I tried a few different ways to replicate this in Python.

import os
import glob

# %timeit walk()
# 1min 12s ± 8.86 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def walk():
    pys = []
    for p, d, f in os.walk('.'):
        for file in f:
            if file.endswith('.rdf'):
                pys.append(file)
    return pys

# %timeit walk2()
# 1min 4s ± 2.12 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def walk2():
    return [
        file
        for p, d, files in os.walk(".")
        for file in files
        if file.endswith(".rdf")
    ]

# %timeit globber()
# 1min 24s ± 2.55 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def globber():
    return glob.glob('**/*.rdf', recursive=True)

# %timeit iglob()
# 1min 39s ± 10.9 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def iglob():
    pys = []
    for file in glob.iglob('**/*', recursive=True):
        if file.endswith('.rdf'):
            pys.append(file)
    return pys

# %timeit iglob2()
# 1min 26s ± 5.74 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def iglob2():
    pys = []
    for file in glob.iglob('**/*.rdf', recursive=True):
        pys.append(file)
    return pys