If you are looking to utilize Python to manipulate your directory tree or files on your system, there are many tools to help, including Python's standard os module. The following is a simple/basic recipe to assist with finding certain files on your system by file extension.
If you have had the experience of "losing" a file in your system where you don't remember its location and are not even sure of its name, though you remember its type, this is where you might find this recipe useful.
In a way this recipe is a combination of How to Traverse a Directory Tree and Recursive Directory Traversal in Python: Make a list of your movies!, but we'll tweak it a bit and build upon it in part two.
To script this task, we can use the walk
function in the os.path
module or the walk
function in the os
module (using Python version 2.x or Python 3.x, respectively).
Recursion with os.path.walk in Python 2.x
The os.path.walk
function takes 3 arguments:
arg
- an arbitrary (but mandatory) argument.visit
- a function to execute upon each iteration.top
- the top of the directory tree to walk.
It then walks through the directory tree under the top, performing the function at every step. Let's examine the function (which we'll define as "step") we use to print the path names of the files under top that have the file extension we can provide through arg
.
Here is the definition of step:
[python]
def step(ext, dirname, names):
ext = ext.lower()
for name in names:
if name.lower().endswith(ext):
print os.path.join(dirname, name)
[/python]
Now let's break it down line-by-line, but first it's very important to point out that the arguments given to step are being passed by directly the os.path.walk
function, not by the user. The three arguments that walk passes on each iteration are:
ext
- the arbitrary argument given toos.path.walk
.dirname
- the directory name for that iteration.names
- the names of all files underdirname
.
The first line of our step function is of course our declaration of the function, and inclusion of the default arguments that will be passed directly by os.path.walk
.
The second line ensures our ext
string is lowercase. The third line begins our loop of the argument names, which is a list type. The fourth line is how we retrieve the names of files with the extension we want, using the string method endswith
to test for a suffix.
The final line prints the path of any file that passes the suffix (extension) test, concatenating the dirname
argument to the name (with the appropriate system-dependent separator).
Now after combining our step function with the walk function, the script looks something like this:
[python]
# We only need to import this module
import os.path
# The top argument for walk. The
# Python27/Lib/site-packages folder in my case
topdir = '.'
# The arg argument for walk, and subsequently ext for step
exten = '.txt'
def step(ext, dirname, names):
ext = ext.lower()
for name in names:
if name.lower().endswith(ext):
print(os.path.join(dirname, name))
# Start the walk
os.path.walk(topdir, step, exten)
[/python]
For my system I have wx_py
installed in the site-packages for Python 2.7, the output looks like this:
[shell]
.\README.txt
.\wx-2.8-msw-unicode\docs\CHANGES.txt
.\wx-2.8-msw-unicode\docs\MigrationGuide.txt
.\wx-2.8-msw-unicode\docs\README.win32.txt
......
.\wx-2.8-msw-unicode\wx\tools\XRCed\TODO.txt</blockquote>
[/shell]
Recursion with os.walk in Python 3.x
Now let's do the same using Python 3.x.
The os.walk
function in Python 3.x works differently, providing a few more options than the other. It takes 4 arguments, and only the first is mandatory. The arguments (and their default values) in order are:
top
-
- - the root of the directory to walk.
topdown(=True)
-
- -
boolean
-
- designating top-down or bottom-up walking.
onerror(=None)
-
- - name of a function to call if an error occurs.
followlinks(=False)
-
- -
boolean
- designating whether or not to follow symbolic links.
The only one we are concerned with for now is the first. Aside from the arguments, perhaps the biggest difference in the two versions of the walk function is that the Python 2.x version automatically iterates over the directory tree, while the Python 3.x version produces a generator function. This means that the Python 3.x version will only go to the next iteration when we tell it to, and the way we will do that is with a loop.
Instead of defining a separate function to call as with step we will write the os.walk
generator into the loop that went into the step
function. Like the Python 2.x version, os.walk
produces 3 values we can use for every iteration (the directory path, the directory names, and the filenames), but this time they are in the form of a 3-tuple, so we have to adjust our method accordingly. Other than that we won't change the extension suffix test at all, so the script ends up looking something like this:
[python]
import os
# The top argument for walk
topdir = '.'
# The extension to search for
exten = '.txt'
for dirpath, dirnames, files in os.walk(topdir):
for name in files:
if name.lower().endswith(exten):
print(os.path.join(dirpath, name))
[/python]
Because my system's Python32/Lib/site-packages folder contains nothing special, the output for this one ends up being just:
[shell]
.\README.txt
[/shell]
This will work the same way for whatever the "topdir" and "exten" strings are set to; however, this script simply prints the filenames to the window (in our examples the Python IDLE window), and if there are many files to print this leaves our interpreter (or shell) window many rows high—kind of a pain to scroll through. If we know that this is the case, it would be much easier to write the results to a text file we can look at anytime. We can do so easily if we incorporate a with
statement (as in Reading and Writing Files in Python) like so:
[python]
with open(logpath, 'a') as logfile:
logfile.write('%s\n' % os.path.join(dirname, name))
[/python]
Let's see first how to incorporate it into the version Python 2.x script:
[python]
# We only need to import this module
import os.path
# The top argument for walk. The
# Python27/Lib/site-packages folder in my case.
topdir = '.'
# The arg argument for walk, and subsequently ext for step
exten = '.txt'
logname = 'findfiletype.log'
def step((ext, logpath), dirname, names):
ext = ext.lower()
for name in names:
if name.lower().endswith(ext):
# Instead of printing, open up the log file for appending
with open(logpath, 'a') as logfile:
logfile.write('%s\n' % os.path.join(dirname, name))
# Change the arg to a tuple containing the file
# extension and the log file name. Start the walk.
os.path.walk(topdir, step, (exten, logname))
[/python]
As we can see above, not much has changed except for the third variable logname
, and the third argument to os.path.walk
. The with statement has replaced the print
statement. Because of the nature of os.path.walk
function, step
is required to open up the log file, write to it, and close it every time it finds a file name; this won't cause any errors but is a bit awkward. We must also note that because the log file is opened up in append mode, it will not overwrite a log file that exists already, it will only append to the file. This means if we run the script 2 or more times in a row without changing the logname
, the results for each run will be added to the same file, which may not be desirable.
The modified version Python 3.x script is much less awkward:
[python]
import os
# The top argument for walk
topdir = '.'
# The extension to search for
exten = '.txt'
logname = 'findfiletype.log'
# What will be logged
results = str()
for dirpath, dirnames, files in os.walk(topdir):
for name in files:
if name.lower().endswith(exten):
# Save to results string instead of printing
results += '%s\n' % os.path.join(dirpath, name)
# Write results to logfile
with open(logname, 'w') as logfile:
logfile.write(results)
[/python]
In this version the name of each found file is appended to the results
string, and then when the search is over, the results are written to the log file. Unlike the Python 2.x version, the log file is opened in write mode, meaning any existing log file will be overwritten. In both cases the log file will be written in the same directory as the script (because we didn't specify a full path name).
With that we have a simple script to find files of a certain extension under a file tree and log those results. In the parts that follow we'll build upon this adding functionality to search for multiple file types, avoid certain paths, and more.