Remember that a hash is a function that takes a variable length sequence of bytes and converts it to a fixed length sequence. Calculating a hash for a file is always useful when you need to check if two files are identical, or to make sure that the contents of a file were not changed, and to check the integrity of a file when it is transmitted over a network. Sometimes when you download a file on a website, the website will provide the MD5 or SHA checksum, and this is helpful because you can verify if the file downloaded well.
Hashing Algorithms
The most used algorithms to hash a file are MD5 and SHA-1. They are used because they are fast and they provide a good way to identify different files. The hash function only uses the contents of the file, not the name. Getting the same hash of two separating files means that there is a high probability the contents of the files are identical, even though they have different names.
MD5 File Hash in Python
The code is made to work with Python 2.7 and higher (including Python 3.x).
[python]
import hashlib
hasher = hashlib.md5()
with open('myfile.jpg', 'rb') as afile:
buf = afile.read()
hasher.update(buf)
print(hasher.hexdigest())
[/python]
The code above calculates the MD5 digest of the file. The file is opened in rb
mode, which means that you are going to read the file in binary mode. This is because the MD5 function needs to read the file as a sequence of bytes. This will make sure that you can hash any type of file, not only text files.
It is important to notice the read
function. When it is called with no arguments, like in this case, it will read all the contents of the file and load them into memory. This is dangerous if you are not sure of the file's size. A better version will be:
MD5 Hash for Large Files in Python
[python]
import hashlib
BLOCKSIZE = 65536
hasher = hashlib.md5()
with open('anotherfile.txt', 'rb') as afile:
buf = afile.read(BLOCKSIZE)
while len(buf) > 0:
hasher.update(buf)
buf = afile.read(BLOCKSIZE)
print(hasher.hexdigest())
[/python]
If you need to use another algorithm just change the md5
call to another supported function, e.g. SHA1:
SHA1 File Hash in Python
[python]
import hashlib
BLOCKSIZE = 65536
hasher = hashlib.sha1()
with open('anotherfile.txt', 'rb') as afile:
buf = afile.read(BLOCKSIZE)
while len(buf) > 0:
hasher.update(buf)
buf = afile.read(BLOCKSIZE)
print(hasher.hexdigest())
[/python]
If you need a list of supported hash algorithms in your system use hashlib.algorithms_available
. (Only works in Python 3.2 and superior). Finally, for another look into hashing, be sure to checkout the hashing Python strings article.