2020-04-02
What’s the most efficient way to recursively list all files under a directory in Python? This came up while I was looking in to a data set at work; the data layout that we got from our client was split up into a prefixed directory hierarchy three levels deep; each level had about 50 folders in it and there were about 300K files at the bottom of the hierarchy. (Most of the files were in a folder by themselves, so you’d think they could have truncated that a bit, but oh well.) Here are some different approaches and timings. Everything is running on macOS 10.14.6 running on a MacBook Pro (13-inch, 2017, Two Thunderbolt 3 ports).
Python’s os.walk worked the best in my timings and for others on StackOverflow, and it seems to be about 20% slower than the system find utility.
You’d figure that for a simple use case like this, the system’s /usr/bin/find utility should be about as fast as you can get, so here are timings for that.
% time find . -name "*.rdf" > /dev/null
find . -name "*.rdf" > /dev/null 2.98s user 37.37s system 75% cpu 53.649 total
% time find . -name "*.rdf" > /dev/null
find . -name "*.rdf" > /dev/null 2.71s user 32.42s system 75% cpu 46.517 totalThe handier fd utility uses multiple threads by default, so it does noticably better — but if you limit it to a single thread, it does noticably worse than /usr/bin/find.
% time fd -e rdf > /dev/null
fd -e rdf > /dev/null 13.00s user 66.04s system 241% cpu 32.686 total
% time fd -e rdf > /dev/null
fd -e rdf > /dev/null 11.86s user 54.77s system 225% cpu 29.573 total
% time fd -j 1 -e rdf > /dev/null
fd -j 1 -e rdf > /dev/null 10.10s user 43.24s system 62% cpu 1:25.31 total
% time fd -j 1 -e rdf > /dev/null
fd -j 1 -e rdf > /dev/null 11.16s user 52.47s system 63% cpu 1:40.98 totalBased on this StackOverflow question, I tried a few different ways to replicate this in Python.
import os
import glob
# %timeit walk()
# 1min 12s ± 8.86 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def walk():
pys = []
for p, d, f in os.walk('.'):
for file in f:
if file.endswith('.rdf'):
pys.append(file)
return pys
# %timeit walk2()
# 1min 4s ± 2.12 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def walk2():
return [
file
for p, d, files in os.walk(".")
for file in files
if file.endswith(".rdf")
]
# %timeit globber()
# 1min 24s ± 2.55 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def globber():
return glob.glob('**/*.rdf', recursive=True)
# %timeit iglob()
# 1min 39s ± 10.9 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def iglob():
pys = []
for file in glob.iglob('**/*', recursive=True):
if file.endswith('.rdf'):
pys.append(file)
return pys
# %timeit iglob2()
# 1min 26s ± 5.74 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def iglob2():
pys = []
for file in glob.iglob('**/*.rdf', recursive=True):
pys.append(file)
return pys