This lesson is being piloted (Beta version)

Moving and Transcoding Files

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How can I use Python to move files from one place to another?

  • How can I use Python to run ffmpeg?

Objectives
  • Use shutil to copy and move files

  • Use subprocess to call ffmpeg

Catch-up Code

If you get lost or fall behind or need to burn it all down and start again, here’s some quick catch-up code that you can run to get back up to speed.

Remember: press Shift+Return to execute the contents of the cell.

video_dir = '/Users/username/Desktop/pyforav'

or

video_dir = 'C:\\Users\\username\\Desktop\\pyforav'

MAKE SURE TO CHANGE USERNAME TO YOUR USERNAME

Then copy and paste the following:

import os
import subprocess
import glob
mov_list = []
mov_list = glob.glob(os.path.join(video_dir, "**", "*mov"), recursive=True)

And run this to confirm that you’re generating a file list properly (you should see a list of file names; if not, call for help!)

mov_list

Remember Alice? She had 3 tasks that she needed to complete.

  1. create service copy MP4s for all of the master files on her hard drive
  2. share the duration of each video file with a cataloger
  3. preserve the files

Alice will need to create a transcoding script to take care of the first step. You may have noticed that the “120th Anniversary” project included service files. So let’s start there.

Moving Files

When we review the directories in which those service files reside, it appears that they’re all located in BagIt bags. Because bags use multiple mechanisms to track the fixity of their contents, we don’t want to change their contents if we can avoid it. In this case, we can make copies of those existing service files.

For any kind of local file moving or copying in Python, the shutil module is a good starting point. Let’s import that.

import shutil

We’ll be using the shutil.copy() function to do this.

help(shutil.copy)

shutil.copy() takes two arguments, src (path to source file) and dest (path to destination). The source path we can pull from our mov_list. But the destination path we’ll have to create anew.

Let’s make a single subfolder within the pyforav directory to hold copies of those service files.

service_folder = os.path.join(video_dir, 'service')
service_folder
'/User/benjaminturkus/Desktop/pyforav/service'

This folder doesn’t actually exist yet. It’s just a string. If we try to copy anything to this non-existent folder, nothing good will happen.

shutil.copy(mov_list[0], service_folder)

Because there wasn’t a landing directory ready for our video files, we ended up with an extensionless file called “service.” We wanted to copy the file into a folder called “service”, except the “service” folder didn’t exist.

So let’s delete that “service” file and do things the right way.

We’ll use the os module to make this folder. But before doing that, it’s good practice to make sure the folder doesn’t already exist. Among python users, this is what’s called Looking Before You Leap (LBYP).

if not os.path.exists(service_folder):
	os.makedirs(service_folder)

And now that the destination directory actually exists, we should be able to copy files there.

shutil.copy(mov_list[0], service_folder)

So now we have the tool to do a drag-and-drop with Python. But just using this tool would mean writing a command for every file we want to copy. That would have the same problems as dragging-and-dropping every file. To turn this tool into something really useful, we need another building block, the for loop.

for loops

Often when we works with lists, we use for loops, a computer programming method for repeating sections of code. During a for loop, Python will take items from a list one-at-a-time and perform the same actions on each item.

for loops will be the most important tool that we’ll use in this workshop.

Dealing with 100’s, 1000’s, or even more files means a lot of batch processing. The for loop is the most common way that we’ll be doing that batch processing.

Python Syntax: for Loops

In the case of Python, there are a few important things to keep in mind about for loops. The first line of a for loop always looks similar to this: for listitem in somelist:

  • for - tells Python it will have to repeat code on multiple items from a list
  • somelist - the list of items to be worked through
  • listitem(s - the generic name(s) to refer to items in the code section, you choose these names The body of the for loop is always indented from the first line and can include multiple lines of code, even additional for loops.
for filepath in mov_list:
    print(filepath)

This command outputs each item from our mov_list. We can perform more complicated actions within the loop.

for filepath in mov_list:
    print(filepath + ' exists')

Printing, for loops, and Jupyter

What happens when you don’t include the print function? Why do you think this is the case?

for filepath in mov_list:
  filepath + ' exists'

Solution

Jupyter only displays the result from the final item in the list, because it only displays the results of the final line of code. Because we’re learning about how Python works, during the workshop we will use the print() function a lot.

Now we have all of the tools to automate the copying:

Putting those tools together, we can write code like this.

mp4_list = glob.glob(os.path.join(video_dir, '**', '*mp4'), recursive=True)
for item in mp4_list:
	shutil.copy(item, service_folder)

And we can check that this code did what we intended either by viewing the folder or by using some more Python code.

os.listdir(service_folder)

That’s a big chunk of Alice’s task performed with a minimal amount of work. For the next part, she’ll need to do some transcoding.

Building for Loops

The process we went through above is a great practice when writing a for loop.

  1. Write the code to run on a single piece of data.
  2. Test out the code and make sure it works. If not, keep developing it.
  3. Add the for loop syntax and adjust any variable names.

That way, if there are any problems in the code you write, you’re more likely to catch them before the for loops makes the same mistake over-and-over again. Imagine if you were copying 1000’s of files to multiple locations and had to find and manually delete them because of a typo.

Using FFmpeg within Python

Python can be used to run other scripts and programs on a computer. For example, if there is a command-line utility that performs an essential function, you can incorporate it into a Python script that automates that action for a group of files. For this lesson, we will use the subprocess module to run terminal commands from our script. The subprocess module works for all command-line tools. Later, we will look at another method that exists for some, but not all, command-line tools.

import subprocess

We’ll use the subprocess.run() function.

help(subprocess.run)

We can try out the example code to see how this works.

subprocess.run(['ls', '-l'])

What do you expect this code to do?

Before you run this code, predict what the results will be. Did the results match your expectations?

Solution

subprocess.run() runs a terminal command and returns whether it was successful or not. 0 means the command finished successfully. 1 means the command had an error and did not finish.

For some commands like ffmpeg this is all we’re interested. But if we want to see the output of the command, we can do the following.

subprocess.run(['ls', '-l'], capture_output=True)

If you’ve used command line tools, you might recognize ls -l as the way to return the contents of a directory as a list. subprocess.run() structures that command differently. Instead of taking a string like subprocess.call('ls -l'), it takes a list where each item in the list is a string. We’ll be taking advantage of that.

First, let’s assemble an FFmpeg command to make mp4/h264 files. And since we’re mortals, we’ll use the wonderful ffmprovisr

The suggested command is ffmpeg -i input_file -c:v libx264 -pix_fmt yuv420p -c:a aac output_file. The ‘subprocess.run()’ function requires the command to be in a list format. We can do this by treating the spaces in the command as the delimiters for each item in our list. When we use this function in a for-loop, most of these items will remain the same each time the command runs, but we will change the input and output files. For those, we will use a variable instead of a string.

['ffmpeg', '-i', input_file, '-c:v', 'libx264', '-pix_fmt', 'yuv420p', '-c:a', 'aac', output_file]

The input_file and ouput_file are like the source and destination paths we needed for shutil.copy(). We can take input_file from the first entry on our mov_list. For output_file, we can use the service_folder, but ffmpeg requires a named output path with a filename.

Since we’re transcoding preservation files to make service files, we can use the preservation filename as the basis for our service filename. And because we’re dealing with paths, let’s use an os.path function. The function os.path.basename will always return the last piece of a path, whether that’s a directory name or filename.

os.path.basename(mov_list[0])
napl1777.mov

That’s a good start. Our service files use an ‘mp4’ wrapper instead of an ‘mov’ wrapper. Because a filename is a string, we can use the replace() method for strings to make this change.

os.path.basename(mov_list[0]).replace('mov', 'mp4')
napl1777.mp4

For a final step, we can join the new filename to the service_folder path.

input_file = mov_list[0]
output_file = os.path.join(service_folder, os.path.basename(mov_list[0]).replace('mov', 'mp4'))
subprocess.run(['ffmpeg', '-i', input_file, '-c:v', 'libx264', '-pix_fmt', 'yuv420p', '-c:a', 'aac', output_file])
CompletedProcess(['ffmpeg', '-i', '/User/benjaminturkus/Desktop/pyforav/federal_grant/napl1777.mov', '-c:v', 'libx264', '-pix_fmt', 'yuv420p', '-c:a', 'aac', '/User/benjaminturkus/Desktop/pyforav/service/napl1777.mp4'], returncode=0)

Using Variables 1

One of the challenging things about learning to program is how and when to use variables. You can be verbose and store the result of every change to a variable like this.

original_filename = os.path.basename(mov_list[0])
output_filename = original_filename.replace('mov', 'mp4')
output_filepath = os.path.join(service_folder, output_filename)
subprocess.run(['ffmpeg', '-i', input_file, '-c:v', 'libx264', '-pix_fmt', 'yuv420p', '-c:a', 'aac', output_filepath])

You can also be very terse and nest all of your functions inside of each other like this.

subprocess.run(['ffmpeg', '-i', input_file, '-c:v', 'libx264', '-pix_fmt', 'yuv420p', '-c:a', 'aac', os.path.join(service_folder, os.path.basename(mov_list[0]).replace('mov', 'mp4'))])

Both of these approaches will work. In practice, most code falls somewhere between these extremes, producing something that is concise but readable. When you’re learning to code, it’s better to focus on getting code to run. Don’t be afraid of using too many variables as long as they help you understand the code you’re writing. And as you advance you may find resources like the PEP-8 style guide useful.

Looking Before You Transcode

It might be tempting to wrap ffmpeg command in a for loop and celebrate, but let’s think through what might happen when we run this code across all of the files in mov_list.

Problems with Transcoding Everything

What issues might you encounter if you tried to transcode every file in mov_list right now? What steps could you take to avoid those issues?

Solution

You would transcode files that you already have service files for, like napl1777.mov. To avoid this, you could use if statements, to skip the transcoding process for those files like the os.makedirs() example above.

Let’s focus on untranscoded movs. To do that, we’ll look before we leap by seeing if a service copy already exists for that mov file. We’ll explore more about how to use if statements in Lesson 8.

for item in mov_list:
	if item.endswith('mov'):
		output_file = os.path.join(service_folder, os.path.basename(item).replace('mov', 'mp4'))
		if not os.path.exists(output_file):
			subprocess.run(['ffmpeg', '-i', item, '-c:v', 'libx264', '-pix_fmt', 'yuv420p', '-c:a', 'aac', output_file])

os.listdir(service_folder)

Success!

Using Variables 2

The code above is an example of how a variable can be useful. By first storing the service file path to output_file, we are able to use it in two different contexts. First, to see if the service file already exists. Second, to store the newly created service file if it doesn’t. Without creating a variable for that file path, we would have needed to run os.path.join(service_folder, os.path.basename(item).replace('mov', 'mp4')) twice.

Key Points

  • Collecting filepaths is often the first step of any AV script