This lesson is being piloted (Beta version)

Data Analysis and Visualization (Bonus)

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • How can I use Python/pandas to analyze data gathered from/about my media collection?

  • How can I use data visualization tools like matplotlib or seaborn to present my analysis in a clear and digestible way?

Objectives
  • Use pandas to analyze MediaInfo metadata

  • Use matplotib to visualize data analysis

What’s Next? Data Analysis and Visualization

Alice is happy with the fruits of her efforts–she’s gathered technical metadata from every file contained on her hard drives, she’s made copies of existing MP4s, she’s created new MP4s, she’s losslessly transcoded her uncompressed Quicktime master files, and she’s found ways to skip over those files (DV-encoded) that wouldn’t benefit from this kind of transformation.

But Alice has realized that while doing the work is one thing, being able to analyze the data surrounding her efforts, and being able to communicate that information upstream to senior management at New Amsterdam, is perhaps as important as creating smart, automated workflows to begin with.

pandas: the Google Sheets/Excel replacement you didn’t know you were missing

pandas is yet another python add-on, this time an extraordinarily powerful data analysis engine that comes from the world of finance, specifically quantitive analytics. While it may seem intimidating, pandas is remarkably user friendly and intuitive. pandas is excellent at:

  • normalizing data from diverse sources
  • performing calculations on rows or columns
  • concatenating or joining or grouping by category
  • visualization

And pandas is really excellent at: reading/writing to a range of metadata formats, including Excel, CSV, JSON, and more!

Alice’s new adventure: creating a scatterplot of compression ratios

After learning to work with FFmpeg, and testing the integrity and losslessness of FFV1, Alice is finally ready to present her findings to administrators and make a compelling case for transitioning the Library to lossless encodings for all audio, video, and film digitization. Alice can demonstrate checksum comparisons, and she can speak anecdotally about file size savings and ease-of-use, but she has a sense that overarching numbers and, even better, visualizations in the form of charts and graphs, might be more convincing to her supervisors who are too busy for the nitty-gritty.

Why a scatterplot? Why not? The combination of pandas and matplotlib can generate nearly any form of visualization–including pie charts, bar charts, heat maps, line graphs, histograms–but there is something unique to scatterplots that can highlight the variance inherent in FFV1 compression. Bottom line: FFV1 is smart and adaptive, and not all video content compresses in exactly the same way, to the exact same degree. With a scatterplot, we can visualize these differences, and possibly learn how to better predict compression ratios in the future.

Another set of FFV1/MKV, but this time with a mistaken twist

Let’s begin by transcoding another set of FFV1/MKV files from our Quicktime masters, but, to make things more interesting, let’s assume that Alice accidentally forgot to include an if statement that separated out her DV-encoded Quicktime files.

mov_list = [ ]

for root, dirs, files in os.walk(video_dir):
    for file in files:
        if file.endswith('.mov'):
            file_path = os.path.join(root, file)
            mov_list.append(file_path)

And let’s make a new landing pad for our new MKVs:

new_mkv_folder = os.path.join('Desktop/pyforav/new_mkvs')
if not os.path.exists(new_mkv_folder):
	os.makedirs(new_mkv_folder)

And let’s transcode our set, this time not paying attention to underlying codecs (DV might get bigger, not smaller!):

for item in mov_list:
        output_file = os.path.join(new_mkv_folder, os.path.basename(item).replace('mov', 'mkv'))
        subprocess.call(['ffmpeg', '-i', item, '-map', '0', '-dn', '-c:v', 'ffv1', '-level', '3', '-g', '1', '-slicecrc', '1', '-slices', '16', '-c:a', 'copy', output_file])

For reference, to identify our DV-encoded Quicktime files, we could easily use a series of nested for loops and if statements:


for item in mov_list:
    media_info = MediaInfo.parse(item)
    for track in media_info.tracks:
        if track.track_type == "Video":
            if track.format == "DV":
                print(item)

Checking our work

If we wanted to see if we had an equal count of MOVs and MKVs, how would we do so?

Solution

len(mov_list) returns 102

len(new_mkv_folder) returns 23

What gives?

The len of mov_list counts the list of full file paths that we created for the MOVs.

The len of new_mkv_folder returns the number of characters in the relative path ‘Desktop/pyforav/new_mkvs’

Annoying, to be sure, but to get the results we’re after, we’ll need to take another os.walk around the block.

mkv_list = [ ]

for root, dirs, files in os.walk(new_mkv_folder):
   for file in files:
      if file.endswith('.mkv'):
           file_path = os.path.join(root, file)
           mkv_list.append(file_path)

len(mkv_list)
102

The DataFrame–the core data structure of pandas

Before we make some cool looking charts, we’ve first gotta gather up the information that we’re hoping visualize. To do this, we’re going to pull some MediaInfo and feed it into a “dataframe,” essentially a spreadsheet written into python/pandas memory. The precise definition of a DataFrame (from geeksforgeeks.com) is “a two-dimensional, size-mutable, potentially heterogenous tabular data structure with labeled axes (rows and columns).” It might make you shudder, but again, just think of it as a nifty spreadsheet.

Step 1: gather up file names and file sizes from our original set of Quicktime files.

all_file_data = []

for item in mov_list:
    media_info = MediaInfo.parse(item)
    for track in media_info.tracks:
        if track.track_type == "General":
            general_data = [
                track.file_name,
                track.file_size]
    all_file_data.append(general_data)

Step 2: feed that information into a dataframe, naming the columns.

import pandas as pd
df = pd.DataFrame(all_file_data, columns = ['filename' , 'origsize']) 
df
	filename	origsize
0	napl0154	10373547
1	napl0202	12261877
2	napl0259	8489273
3	napl0408	6603097
4	napl0497	11313821
...	...	...
97	stantonrequest_0162	7544373
98	stantonrequest_0196	6597103
99	stantonrequest_0284	9426295
100	stantonrequest_1407	7543563
101	stantonrequest_1882	10366815
102 rows × 2 columns

Step 3: gather the file sizes of the MKVs, and add them to the dataframe by simply declaring the list as a new column.

mkv_sizes = []
for item in mkv_list:
    media_info = MediaInfo.parse(item)
    for track in media_info.tracks:
        if track.track_type == "General":
            mkv_sizes.append(track.file_size)


df['new_size'] = mkv_sizes

df
filename	      origsize	    new_size
0	napl0154	10373547	4941973
1	napl0202	12261877	385272
2	napl0259	8489273   	325594
3	napl0408	6603097   	2803794
4	napl0497	11313821	179603
...	...	...	..
102 rows × 3 columns

By default, pandas only displays a select number of rows; to get a full and complete output, you can make an adjustment:

pd.set_option('display.max_rows', 1000)

matplotlib and data visualization

Let’s take our dataframe and transform it into something both visually appealing and communicative. To make this happen, we’ll call upon matplotlib, a 2D plotting library which offers a host of amazing features and customizability.

First we’ll import matplotlib and ask Jupyter to display visualizations directly in our notebook.

import matplotlib.pyplot as plt
%matplotlib inline

Next, we’ll create our first scatterplot, using the original size as our starting point and the ratio between the new and original sizes as our point of comparison.

plt.scatter(df['origsize'], (df['new_size'] / df['origsize']))

First-Lossless-Scatter

From here, we can create an image with a bit more pizzazz by adding a title, labels to the X- and Y-axes, color-coding our scatter dots, and incorporating a special Seaborn “darkgrid” style.

plt.style.use('seaborn-darkgrid')
plt.title("Lossless Compression Results", fontsize=12, fontweight='bold')
plt.ylabel('Compression Ratio')
plt.xlabel('Original File Size')
plt.scatter(df['origsize'], (df['new_size'] / df['origsize']), edgecolors='r', c='blue')

And finally, if we want to save our image for future use, we can save our figure to the location of our choosing:

plt.style.use('seaborn-darkgrid')
plt.title("Lossless Compression Results", fontsize=12, fontweight='bold')
plt.ylabel('Compression Ratio')
plt.xlabel('Original File Size')
plt.scatter(df['origsize'], (df['new_size'] / df['origsize']), edgecolors='r', c='blue')
plt.savefig('/Users/username/Desktop/scatter.png', dpi=300, bbox_inches='tight')

Final-Lossless-Scatter

If the scatterplot is feeling a little “what is happening here?” to you, know that pandas and matplotlib can be used in more straightforward applications.

Here, for example, is a breakdown of objects digitized by format:

Piechart-FormatBreakdown

And here’s a line graph of objects digitized per month:

Linegraph-ObjectsPerMonth

Key Points

  • Automate your CSV/Excel

  • Analyze data as you go