OverviewTeaching: 0 min
Exercises: 0 minQuestions
How can I use Python to generate checksums?
How can I use Python to validate bags?Objectives
Use hashlib to generate checksums
Use bagit-python to validate bags
People keep asking Alice about her fixity practices. She knows that the files from the 3 projects have different fixity information. What can she Python about this
She wants to see what she can do with Python.
All of the sub-folders of
120th anniversary are bags.
If you’ve ever used the
Bagger application, you might dread the idea of opening each bag, clicking the validate button, waiting for however long, seeing ‘Valid!’, and then repeating this loop.
But, there is a bagit module for Python.
Let’s try it out.
First we install the module.
pip install bagit
Then we import the module.
bagit module works as follows.
A folder can be loaded as a bag by sending it’s path to
Once loaded, the bag object has a function to validate itself.
bag = bagit.Bag(os.path.join('Desktop', 'pyforav', '120th_anniversary', '1143')) bag.validate()
The result of
validate() will be
Automate bag checking
Knowing how the
bagitmodule works, create a
forloop that checks all of the bags in the
First you will need to create a list of all the subfolders of
bag_paths = glob.glob(os.path.join('Desktop', 'pyforav', '120th_anniversary', '*')) for bag_path in bag_paths: bag = bagit.Bag(bag_path) result = bag.validate() print(bag_path, result)
For an extra challenge, how would you report these results as a CSV?
Files without checksums
There are no checksums to check for the files in
Instead, checksums can be created.
One way to do this would be to turn the folder into a bag.
bagit module has a
Creating a bag
How do you expect the
bagit.make_bag()requires a path as an argument. Once run,
make_bag()creates a bag in place, moving the contents of a folder into a
datasubdirectory, and writing manifests. Warning: because the following code will move all the files in
on_demand, you might want to copy this folder, so you can delete the bag you create.
bagit.make_bag(os.path.join('Desktop', 'pyforav', 'on_demand'))
Another approach would be create checksum sidecar files like those in the
For that, we’ll need to use the
import hashlib BLOCKSIZE = 65536 ondemand_paths = glob.glob(os.path.join('Desktop', 'pyforav', 'on_demand', '*mov')) hasher = hashlib.md5() with open(ondemand_paths, 'rb') as f: buffer = f.read(BLOCKSIZE) while len(buffer) > 0: hasher.update(buffer) buffer = f.read(BLOCKSIZE) sidecar_path = ondemand_paths.replace('mov', 'md5') with open(sidecar_path, 'w') as f: f.write(hasher.hexdigest())
It can be a good exercise to read and understand what is happening in this code. It’s also a fair reaction to look at this code, look at another tool for creating checksum sidecar files, and choose the other tool. For example:
md5 Desktop/pyforav/on_demand/middlerrequest_1669.mov > Desktop/pyforav/on_demand/middlerrequest_1669.md5
This code creates nearly the same result without consideration about
buffer, or reading files.
When to Python
One of the skills you will build as you use Python is figuring out when you don’t want to use Python.
Maybe you have another tool that already does the job well-enough. Maybe the code examples you’re finding look like they will take longer to understand than the time you’ll save using them.
Python is a language. The more you use it, the more you will change how you engage with it.
Just because you can in Python, doesn’t mean it’s the most convenient