In Part 1 we looked at how to use the os.path.walk and os.walk methods to find and list files of a certain extension under a directory tree. The former function is only present in the Python 2.x, and the latter is available in both Python 2.x and Python 3.x. As we saw in the previous article, the os.path.walk
method can be awkward to use, so from now on we'll stick to the os.walk
method, this way the script will be simpler and compatible with both branches.
In Part 1 our script traversed all the folders under the topdir
variable, but only found files of one extension. Let's now expand that to find files of multiple extensions in select folders under the topdir
path. We'll first search for files of three different file extensions: .txt, .pdf, and .doc. Our extens
variable will be a list of strings instead of one:
[python]
extens = ['txt', 'pdf', 'doc']
[/python]
The .
character is not included in these strings as it was in the ext
variable as before, and we'll see why shortly. In order to save the results (the file names) we'll use a dictionary with the extensions as keys:
[python]
# List comprehension form of instantiation
found = { x: [] for x in extens }
[/python]
The other variables will remain the same for now; however, the script file itself will be placed in (and will execute from) my system's “Documents” folder, so the topdir
variable will become that path.
Previously we tested for the extension with the str.endswith
method. If we were to use it again we'd have to loop through the extension list and test with endswith
for every file name, but instead we'll use a slightly different approach. For each file stepped on during the walk we'll extract the extension and then test for membership in extens. Here's how we'll extract it:
[python]
for name in files:
# Split the name by '.' & get the last element
ext = name.lower().rsplit(“.”, 1)[-1]
[/python]
As with the previous part, we put this line inside the for loop that interates over the files list returned by os.walk
. With this line we combined three operations: changing the case of the file name, splitting it, and extracting an element. Calling str.lower
on the filename changes it to lowercase. The same as all strings in extens
. Calling str.rsplit
on name then splits the string into a list (from the right) with the first argument .
delimiting it, and only making as many splits as the second argument (1). The third part ([-1]
) retrieves the last element of the list—we use this instead of an index of 1 because if no splits are made (if there is no .
in name
), no IndexError
will be raised.
Now that we've extracted the extension of name
(if any), we can test to see if it's in our list of extensions:
[python]
if ext in extens:
[/python]
This is why .
doesn't precede any of the extension names in extens
, because ext
won't ever have one. If the condition is true, we'll add the name found to our found
dictionary:
[python]
if ext in extens:
found[ext].append(os.path.join(dirpath, name))
[/python]
The above line will append the result path (dirpath
joined to name
returned from os.walk
) to the list at the ext
key in found
. Now that we have changed the search extensions and list of results we also have to adjust how to save the results to our log file.
In the previous version (using os.walk
) we simply opened a file at logname
and wrote the results to the file. In this version we must loop through multiple categories in the results, one for each extension. We'll concatenate each result list in found
to our results string, which we'll now identify as logbody
. We'll also add a small header to the logfile, loghead:
[python]
# The header in our logfile
loghead = 'Search log from filefind for files in {}\n\n'.format(os.path.realpath(topdir))
# The body of our log file
logbody = ''
# Loop through results
for search in found:
# Concatenate the result from the found dict
logbody += "<< Results with the extension '%s' >>" % search
# Use str.join to turn the list at search into a str
logbody += '\n\n%s\n\n' % '\n'.join(found[search])
[/python]
The format of the results can be whatever or however you like, but it is important that we loop through all of the results to get the full log. After the logbody
is complete, we can write our log file:
[python]
# Write results to the logfile
with open(logname, 'w') as logfile:
logfile.write('%s\n%s' % (loghead, logbody))
[/python]
Note: if any names/paths in the solution contain non-ASCII characters, we would have to change the open
mode to wb
and decode loghead
and logbody
(or encode if in Python 3.x) in order to save the logfile
successfully.
Now we are finally ready to test our script. Running it on my system yields this log file (shortened):
[shell]
Search log from filefind for files in C:\Python27\Lib\site-packages
<< Results with the extension 'pdf' >>
.\GPL_Full.pdf
.\beautifulsoup4-4.1.3\doc\rfc2425-v2.1.pdf
.\beautifulsoup4-4.1.3\doc\rfc2426-v3.0.pdf
<< Results with the extension 'txt' >>
.\README.txt
.\soup.txt
.\beautifulsoup4-4.1.3\AUTHORS.txt
.\beautifulsoup4-4.1.3\COPYING.txt
...
.\wx-2.8-msw-unicode\docs\CHANGES.txt
.\wx-2.8-msw-unicode\docs\MigrationGuide.txt
.\wx-2.8-msw-unicode\docs\README.win32.txt
...
.\wx-2.8-msw-unicode\wx\tools\XRCed\TODO.txt
<< Results with the extension 'doc' >>
[/shell]
This log tells us that in the C:\Python27\Lib\site-packages
directory there are a few PDF files, many text files, and no ".doc" or Word files. It seems to work fine, and the extension search list can be changed easily, but what if we don't want to search in the "docs" directory under the wx-2.8-msw-unicode
tree? After all, we know there will probably be lots of text files in there. We can ignore this directory by modifying the dirnames
list in-place in the main walk loop. Because we might want to ignore more than one directory, we'll keep a list of them (this will come before the loop of course):
[python]
# Directories to ignore
ignore = ['docs', 'doc']
[/python]
Now that we have the list, we'll add this small loop inside the main walk loop (and before the loop over the file names):
[python]
# Remove directories in ignore
# Directory names must match exactly!
for idir in ignore:
if idir in dirnames:
dirnames.remove(idir)
[/python]
This will edit dirnames
in-place, so that the next iteration of the walk loop will no longer include the folders named in ignore. The full script with the new walk loop now looks like this:
[python]
import os
# The top argument for name in files
topdir = '.'
extens = ['txt', 'pdf', 'doc'] # the extensions to search for
found = {x: [] for x in extens} # lists of found files
# Directories to ignore
ignore = ['docs', 'doc']
logname = "findfiletypes.log"
print('Beginning search for files in %s' % os.path.realpath(topdir))
# Walk the tree
for dirpath, dirnames, files in os.walk(topdir):
# Remove directories in ignore
# directory names must match exactly!
for idir in ignore:
if idir in dirnames:
dirnames.remove(idir)
# Loop through the file names for the current step
for name in files:
# Split the name by '.' & get the last element
ext = name.lower().rsplit('.', 1)[-1]
# Save the full name if ext matches
if ext in extens:
found[ext].append(os.path.join(dirpath, name))
# The header in our logfile
loghead = 'Search log from filefind for files in {}\n\n'.format(
os.path.realpath(topdir)
)
# The body of our log file
logbody = ''
# loop thru results
for search in found:
# Concatenate the result from the found dict
logbody += "<< Results with the extension '%s' >>" % search
logbody += '\n\n%s\n\n' % '\n'.join(found[search])
# Write results to the logfile
with open(logname, 'w') as logfile:
logfile.write('%s\n%s' % (loghead, logbody))
[/python]
With our new ignored files element, the log file turns out looking like this (shortened):
Search log from filefind
for files in C:\Python27\Lib\site-packages
[shell]
<< Results with the extension 'pdf' >>
.\GPL_Full.pdf
<< Results with the extension 'txt' >>
.\README.txt
.\soup.txt
.\beautifulsoup4-4.1.3\AUTHORS.txt
.\beautifulsoup4-4.1.3\COPYING.txt
...
.\beautifulsoup4-4.1.3\scripts\demonstration_markup.txt
.\wx-2.8-msw-unicode\wx\lib\editor\README.txt
...
.\wx-2.8-msw-unicode\wx\tools\XRCed\TODO.txt
<< Results with the extension 'doc' >>
[/shell]
Our ignore list worked just as we wanted it to, cutting out the full tree under the "docs" directory in wx-...-unicode
. We can also see that the other ignore directory ("doc") cut out the other two PDF files from our PDF results, and for both directories we didn't need to name the full path (because the name won't be the full path in dirnames
anyway). This can be convenient but always remember that this method will prune out any part of the tree under any name that matches one in the ignore
list (to avoid this try using the dirpath
and dirnames
together to specify full paths to ignore, if you don't mind going through the trouble of naming the full path!).
Now that we've completed this version of our file/directory manipulation script, we can search for multiple file extensions under any tree fast and have a record of all those found with just a double-click. This is great if we simply want to know where all the files exist, but since they likely will not all be in the same folder, if we wanted to move/copy them all to the same folder or do something else with all of them simultaneously, looking through each line of the log file would not be preferable. This is why in the next part we'll look at how to upgrade our script to move, copy/backup, or alternatively erase all the files we are looking for.