When you use a scripting language like Python, one thing you will find yourself doing over and over again is walking a directory tree, and processing files. While there are many ways to do this, Python offers a built-in function that makes this process a breeze.
What is the os.walk() Function?
The walk function is like the os.path function but works on any operating system. Python users can utilize the function to generate the file names in a directory tree. The function navigates the tree in both directions, top-down and bottom-up.
Every directory in any tree in any operating system has a base directory that acts as a subdirectory. The os.walk() function generates the result in three tuples: the path, the directory, and the files present in any subdirectory.
The tuples generated are:
- Dirpath: This string leads the files or the folder to the directory path.
- Dirnames: This comprises all the subdirectories that don’t contain “.” And “..”.
- Filenames: This is a list of files or folders that may be system- or user-created. It is the directory path that contains files besides the directory files.
It’s important to note that the names on the list do not contain any component of the path. If a user wants to fetch the full path that starts at the top of the directory or file in the path, they must use os.walk.join(), which has arguments of dirpath and the directory name.
As mentioned earlier, the os.walk() function can traverse a tree in both top-down and bottom-up manner. The top-down and bottom-up are the two optional arguments, and either one must be used in the function if the user wants to generate a sequence of directories.
In some cases, the top-down traversal option is used by default if the user does not mention any argument pertaining to the sequence. If the top-down argument is true, the function generates the triple for the main directory first and the subdirectories later.
On the other hand, if the top-down argument is false, the function generates the triple for the directory after the subdirectories. In simple words, the sequence is generated in a bottom-up manner.
Furthermore, when the top-down argument is true, users can update the directory name list and the os.walk() function will only apply to subdirectories. Updating the names of directories when the top-down is false is not possible since the directory names are displayed before the path in the bottom-up mode.
Using the listdir() function can remove the errors by default.
Working of Python os.walk() Function
A file system is traversed in a specific way in Python. The file system is like a tree with a single root that divides itself into branches, and the branches expand into sub-branches, and so on.
The os.walk() function generates the names of the files in a directory tree by traversing the tree from the bottom or the top.
Syntax of os.walk()
The syntax of the os.walk function is:
os.walk(top[, topdown=True[ onerror=None[ followlinks=False]]])
Where:
- Top: It indicates the starting point or the “head” of a subdirectory traverse. As mentioned earlier, it generates three tuples.
- Topdown: When this attribute is true, the directories are scanned from top to bottom, and when false, the directories are scanned from bottom to the top.
- Onerror: This is a special attribute that helps monitor the error. It will either display an error to keep going with the function or raise an exception to dismiss the function.
- Follow links: When set to true, the attribute results in unstoppable recursions if any link points to its own base directory. It’s important to note that the os.walk() function never takes the record of the directories that it has previously traversed.
How to Use os.walk()
Since os.walk() works with the file structure of the operating system, users must first import the os module into the Python environment. The module is a part of the standard Python installation and will address any dependencies in the rest of the file-listing script.
Next, users must define the file listing function. Users can name it anything, but using a name that makes its purpose clear is best practice. The function must be given two arguments: filetype and filepath.
The filepath argument will indicate to the function where it must start looking for files. It will utilize the file path string in your operating system’s format.
---
Note: Escaping or encoding characters as appropriate is a must.
---
When the file listing function runs, the argument assumes that the base directory has all the files and subfolders that the user requires it to check.
On the other hand, the filetype argument will indicate to the function what type of file the user is looking for. The argument accepts the file extension in the string format, for example, “.txt.”
Next, the user needs to store all the relevant file paths the script finds within the file listing function. For this reason, users must create an empty list.
When the function is utilized, it will find every file within the filepath and verify whether the extension matches the required filetype. It will then add the relevant results to the empty list.
Therefore, to begin the iterative process, we must use a for loop to examine each file. The os.walk() function will then find all the files and path in filepath and generate a tuple of three. Let us assume we name these components root, dirs, and files.
Since the files component will list all names of files within the path, the function must iterate through every file name. For this, we must write a for loop.
Now, under this file-level loop, the file listing function must examine all the aspects of every file. If the application you’re writing has other requirements, this is where you must alter the script. However, for the sake of explanation, we will focus on checking all the files for the required file extension.
In Python, comparing strings is case-sensitive. However, file extensions are written in different cases. Therefore, we must use the lower() method to turn both the file and filetype into lower-case strings. This way, we avoid missing any files due to a mismatch in capitalization.
Next, we must use the endswith() method to compare the end of the lowercase file attribute where the file extension is stored with the lowercase filetype attribute. The method will return True or False depending on whether there is a match or not.
The Boolean result must then be included in an if statement, so the next lines in the script are triggered only if there is a matching file type.
In the event that the file extension matches the requirement, the information about the file attribute and its location must be added to the paths component, which is our list of relevant file paths.
Using the os.path.join() function will combine the root file path and the file name to make a complete address that the operating system can work with. The data can be combined using the append() method.
Finally, Python will iterate through the loops, going through all the folders and files and building a paths list without hassle. To make this list available outside the file listing function, we must write return(paths) at the end of the script.
Overall, the code should look like this:
import os def list_files(filepath, filetype): paths = [] for root, dirs, files in os.walk(filepath): for file in files: if file.lower().endswith(filetype.lower()): paths.append(os.path.join(root, file)) return(paths)
After this script, you must write another function to save the resulting locations into a file at a location of your choosing. The code could look something like this:
my_files_list = list_files(' C:\\Users\\Public\\Downloads', '.csv')
Now that your script is ready to find the files you need, you can focus on analyzing text, merging data, or whatever else you need to do.
Basic Python Directory Traversal
Here's a really simple example that walks a directory tree, printing out the name of each directory and the files contained:
[python] # Import the os module, for the os.walk function import os # Set the directory you want to start from rootDir = '.' for dirName, subdirList, fileList in os.walk(rootDir): print('Found directory: %s' % dirName) for fname in fileList: print('\t%s' % fname) [/python]
os.walk
takes care of the details, and on every pass of the loop, it gives us three things:
dirName
: The next directory it found.subdirList
: A list of sub-directories in the current directory.fileList
: A list of files in the current directory.
Let's say we have a directory tree that looks like this:
+--- test.py | +--- [subdir1] | | | +--- file1a.txt | +--- file1b.png | +--- [subdir2] | +--- file2a.jpeg +--- file2b.html
The code above will produce the following output:
[shell] Found directory: . file2a.jpeg file2b.html test.py Found directory: ./subdir1 file1a.txt file1b.png Found directory: ./subdir2 [/shell]
Changing the Way the Directory Tree is Traversed
By default, Python will walk the directory tree in a top-down order (a directory will be passed to you for processing), then Python will descend into any sub-directories. We can see this behaviour in the output above; the parent directory (.) was printed first, then its 2 sub-directories.
Sometimes we want to traverse the directory tree bottom-up (files at the very bottom of the directory tree are processed first), then we work our way up the directories. We can tell os.walk
to do this via the topdown parameter:
[python] import os rootDir = '.' for dirName, subdirList, fileList in os.walk(rootDir, topdown=False): print('Found directory: %s' % dirName) for fname in fileList: print('\t%s' % fname) [/python]
Which gives us this output:
[shell] Found directory: ./subdir1 file1a.txt file1b.png Found directory: ./subdir2 Found directory: . file2a.jpeg file2b.html test.py [/shell]
Now we get the files in the sub-directories first, then we ascend up the directory tree.
Selectively Recursing Into Sub-Directories
The examples so far have simply walked the entire directory tree, but os.walk
allows us to selectively skip parts of the tree.
For each directory os.walk
gives us, it also provides a list of sub-directories (in subdirList
). If we modify this list, we can control which sub-directories os.walk
will descend into. Let's tweak our example above so that we skip the first sub-directory.
[python] import os rootDir = '.' for dirName, subdirList, fileList in os.walk(rootDir): print('Found directory: %s' % dirName) for fname in fileList: print('\t%s' % fname) # Remove the first entry in the list of sub-directories # if there are any sub-directories present if len(subdirList) > 0: del subdirList[0] [/python]
This gives us the following output:
[shell]Found directory: . file2a.jpeg file2b.html test.py Found directory: ./subdir2 [/shell]
We can see that the first sub-directory (subdir1) was indeed skipped.
This only works when the directory is being traversed top-down since for a bottom-up traversal, sub-directories are processed before their parent directory, so trying to modify the subdirList
would be pointless since by that time, the sub-directories would have already been processed!
It's also important to modify the subdirList
in-place, so that the code calling us will see the changes. If we did something like this:
[python] subdirList = subdirList[1:] [/python]
... we would create a new list of sub-directories, one that the calling code wouldn't know about.
Four Other Ways of Listing a File in a Directory
Besides the os.walk() method, there are four other ways of listing a file in a directory:
#1 Listing All Files of a Directory with listdir() and isfile() functions
Using the listdir() and isfile() functions in tandem makes it easy to get a list of the files in a directory way. The two functions are part of the os module. Here is how you use them:
Step 1: Import the os Module
The os module is a standard Python module that enables users to work with functionality dependent on the operating system. It comprises many methods that enable users to interact with the operating system, including the file system.
Step 2: Use the os.listdir() Function
Using the os.listdir() function and passing the path attribute to it will return a list of names of the files and directories that are provided by the path attribute.
Step 3: Iterate the Result
Write a for loop to iterate the files that the function returns.
Step 4: Use the isfile() Function
Every iteration of the loop must have the os.path.isfile(‘path’) function to verify whether the current path is a file or a directory.
If the function finds that it is a file, it returns True, and the file is added to the list. Else the function returns False.
Here’s an example of the listdir() function listing only the files from a directory:
import os # setting the folder path dir_path = r'E:\\example\\' # making a list to store files res = [] # Iterating the directory for path in os.listdir(dir_path): # check whether the current path is a file if os.path.isfile(os.path.join(dir_path, path)): res.append(path) print(res)
It’s important to note that the listdir() function only lists the files that are in the current directory.
If you’re familiar with generator expressions, you can shorten the script and also make it simpler, like so:
import os def get_files(path): for file in os.listdir(path): if os.path.isfile(os.path.join(path, file)): yield file # Now, you can plainly call it whatever you want for file in get_files(r'E:\\example\\'): print(file)
The listdir() function can also be used to list both files and directories. Here’s an example:
import os # folder path dir_path = r'E:\\account\\' # list file and directories; Directly call the listdir() function to get the content of the directory. res = os.listdir(dir_path) print(res) |
#2 Using the os.scandir() Function to Get Files of a Directory
The scandir() function is known for being faster than the os.walk() function, and it also iterates directors more efficiently. It is a directory iteration function similar to listdir(); the only difference is that it yields the DirEntry objects that include the file type data and the name instead of returning a list of plain filenames.
Utilizing the scandir() function increases the speed of the os.walk() function between two and 20 times, depending on the operating system and file system configuration. It provides this speed boost by avoiding unnecessary calls to the os.stat() function.
It’s important to note that the scandir() function returns an iterator of os.DirEntry objects, containing file names.
The scandir() function was included in the standard Python library in Python 3.5 back in September 2015.
Here’s an example of using the function to retrieve the files of a directory:
import os # retrieve all the files from inside the specified folder dir_path = r'E:\\example\\' for path in os.scandir(dir_path): if path.is_file(): print(path.name)
#3 Using the Glob Module
The glob module is also a part of the standard Python library. Users can utilize the module to find the files and folders whose names follow a specified pattern.
For instance, if you want to get all the files of a directory, you can use the dire_path/*.* pattern, where the “*.*” means files with any extension.
Here’s an example of using the module to retrieve a list of files in a directory:
import glob # search all the files inside a directory # The *.* indicates that the file name may have any extension dir_path = r'E:\example\*.*' res = glob.glob(dir_path) print(res)
You can also use the module to list files from the subdirectories by setting the recursive attribute to true:
import glob # search all the files inside a directory # The *.* indicates that the file name may have any extension dir_path = r'E:\demos\files_demos\example\**\*.*' for file in glob.glob(dir_path, recursive=True): print(file)
#4 Using the Pathlib Module
The pathlib module was introduced to the Python standard library in Python 3.4. It offers a wrapper for most operating system functions. It includes classes and methods that enable users to handle filesystem paths and retrieve data related to files for various operating systems.
Here’s how you use the module to retrieve the list of files in a directory:
- Import the pathlib module.
- Write a pathlib.Path(‘path’) line to construct the directory path.
- Use the iterdir() function to iterate all the entries in a directory.
- Finally, check whether the current entry is a file using the path.isfile() function.current entry is a file using the path.isfile() function.
Here’s an example script utilizing the pathlib module for this purpose:
import pathlib # Declaring the folder path dir_path = r'E:\\example\\' # making a list to store file names res = [] # constructing the path object d = pathlib.Path(dir_path) # iterating the directory for entry in d.iterdir(): # check if it a file if entry.is_file(): res.append(entry) print(res)
When is the Right Time to Use the os.listdir() Function Instead of the os.walk() Function?
This is one of the most common questions programmers ask after they learn about os.walk(). The answer is straightforward:
The os.walk() function will return a list of all the files in a file tree, while the os.listdir() function will return a list of all the files and folders in a directory.
Understanding the working of the os.listdir() function can get clearer with an example:
Let us assume that we have a directory “Example,” with three folders A, B, and C, and two text files 1.txt and 2.txt. The A folder has another folder, Y, which contains two files. Both B and C folders have one text file each.
Passing the directory path of the “Example” folder to the os.listdir() method:
import os example_directory_path = './Example' print(os.listdir(example_directory_path))
The above script will give you an output of:
['1.txt', '2.txt', 'A', 'B', 'C']
In other words, the os.listdir() function will only generate the first “layer” of the directory. This makes it very different from the os.walk() method, which searches through the entire directory tree.
Simply put, if you need a list of all the file and directory names in a root directory, you can use the os.listdir() function. However, if you want to have a look at the entire directory tree, using the os.walk() method is the right way to go.
Conclusion
Using the os walk function in Python is one of the handy ways to traverse all the paths in a directory in both a top-to-bottom and bottom-to-top manner. We have also covered the four other ways of listing a file in a directory in this post.
Now that you know about the different approaches, you can write a script to traverse the directory without much hassle and focus your efforts on analyzing text or merging data.
For a more comprehensive tutorial on Python's os.walk
method, checkout the recipe Recursive File and Directory Manipulation in Python. Or to take a look at traversing directories in another way (using recursion), checkout the recipe Recursive Directory Traversal in Python: Make a list of your movies!.