Storage & Data Transfer

Archive media for the long term with preservation masters

October 2, 2019

Buzz Hays

Global Lead, Entertainment Industry Solutions, Google Cloud

Media and entertainment companies have plenty of storage needs, and have to make sure they’re storing the right data for the right period of time. Media archives are often thought of as repositories of readily available media to be accessed for various workflow needs in the near and immediate term, such as in editorial and post-production as well as for storing distribution masters.

But long-term digital media preservation may not always be the highest priority for busy production companies and media archives. Arguably the most important role of a media archive is to provide preservation masters of the content that can be retained and accessed at any point in the future. When you’re working with media files in Google Cloud Platform (GCP), preservation is an important consideration.

Digital media archives usually include a variety of file types, including moving image files, still images, binary files, and documents. Within the category of moving image files, there are hundreds of wrapper types, file types, and codecs. Although these media formats are readily available today and are easy to integrate into workflows, they are not necessarily designed for long-term preservation. Over the years, media types can change frequently and become obsolete (think of eight-track tapes, CDs or DLT backup tapes).

Codecs have a shelf life, and moving image compression is constantly being refined. More popular formats are regularly updated and improved, but others get discontinued, making it very difficult or impossible to read the media files in the future. Additionally, some codecs require licenses that could create problems years down the line when the codec developer is no longer in business. For example, broadcast masters are often recorded with a visually lossless codec that works well through the workflow ecosystem, but that codec can become obsolete. If your archived media files are stored in a compressed form in an end-of-life codec, you may be able to find a way to read them. But, at worst, you’ll be paying to store a considerable amount of useless data on cloud storage and you will have lost the underlying media. So compressed video files and proxies are not appropriate for long-term storage.

It’s important to consider ways to create preservation masters of your media that can withstand the test of time. We’ll explain here how to create preservation masters using GCP, specifically Cloud Storage, so that your archived media files will be accessible well into the future.

It’s important to note that media asset management systems rely on proxy files for ease of search, for defining clips and for initiating transcodes among other tasks, such as machine learning (ML) and artificial intelligence (AI) analysis. Creating these files in a common format can make archive maintenance easier. These files are designed to represent a compressed version of the source media for efficient storage, retrieval and review, often at a lower resolution and quality than the originals. You should consider these as working media files, separate and distinct from archive or preservation master files.

Recommended practice for creating media preservation masters

Within a media archive, preservation masters should be stored in a format that can be retrieved and read easily at any point in the future. The recommended practice is to convert the preservation masters to frame sequences from the original source movie files from which all proxies were derived. The Academy of Motion Picture Arts & Sciences (AMPAS), the National Archives, and the Library of Congress all agree on this frame sequence approach. You can create file sequences from movie files with a variety of tools, such as FFMPEG, OpenDCP, or any number of transcoder solutions. (We’ll describe an example using FFMPEG later in this post.)

Once you have these resulting frame sequences, store them in a format that mirrors the quality and resolution of the source material as closely as possible. You can then move these files to the longest-term Coldline storage in Cloud Storage for preservation. This also complies with motion picture industry requirements: a minimum of three copies of all media stored in geographically disparate locations. This provides for disaster recovery of media files, should physical data tape copies become damaged or lost. Coldline storage is ideal for what is often referred to as the “third copy”—the copy of last resort, should all other copies fail. The preservation master rarely, if ever, needs to be accessed in this scenario, since the mezzanine and proxy files of the media can be stored in Standard, Nearline and Coldline storage for any near-term use of the media. Over time, as new, higher-quality codecs become available, you can leverage the preservation masters to create new sets of proxies and mezzanine files using the new codecs and formats as needed.

There are various formats available for storing image sequences that are appropriate for preservation masters. DPX is the most common format (originally developed by Kodak), while OpenEXR and JPEG 2000 are becoming more popular. Although some of these formats use compression, they are considered valid, high-quality archive formats by archivists around the world.

Most archives have specifications on the preferred formats for particular applications. There is no one size fits all when it comes to frame formats for preservation, as it really depends on the source material and its specifications. For example, an old black-and-white newsreel is transferred from film to digital video with an aspect ratio of 1.33 to 1 at standard definition resolution. There’s no reason to archive this media with 16-bit color at HD resolution, as the information doesn’t exist in the source material, and archiving at a higher color depth and resolution only makes for larger file sizes with no improvement in the quality of the media itself.

Creating media preservation masters at ingest

As part of the input pipeline of any cloud-based workflow, consider creating a preservation master file sequence at the same time the content is ingested into the system. With this parallel process, any proxies or mezzanine copies required by the workflows can be created at the same time as the master file, so you don’t have to move large amounts of data in and out of various storage classes. The preservation master can be moved to Coldline storage once all of the relevant file naming, metadata, fixity/digest entry, and formatting stages are complete.

Here’s an example workflow:

Create checksums of the source media files on local machine

Log file names into the media asset management (MAM) system
Log source checksums into the MAM

Copy the media files into Cloud Storage
Compare checksums of the source files to the GCP copies
Transcode the source files into proxies

Mezzanine format: Log mezzanine file names/locations into MAM
Proxies for ML/AI/search/MAM applications: Log proxy filenames/locations into MAM
Apply ML/AI APIs for metadata extraction
Log metadata into MAM

Transform source files into image sequences

Image: Use FFMPEG

TIFF, DPX, OpenEXR or other archival formats

Audio: Use FFMPEG

Uncompressed WAV or other archival audio format

Move the image sequences and audio files into Coldline storage

Log the file location paths into the MAM
Log checksums into the MAM (derived from file headers)

Create an image sequence using FFMPEG

A number of tools for image manipulation can create archive-quality image sequences from movie files. FFMPEG is an open source tool used for a wide variety of media processing needs. Here’s a tutorial using FFMPEG to create an image sequence from a movie file (note that the particulars of your process may vary based on your company’s policies and other details).

1. Download and install FFMPEG for your operating system on your local machine. You will use your local terminal or shell for these exercises.

While FFMPEG is very extensive in its capabilities, for the purposes of this tutorial, you’ll only need to focus on a few simple commands. Check out the documentation for FFMPEG if you want to explore the tool further.

Please note that your file storage footprint may increase when extracting image sequences. For example, when extracting files to a DPX sequence format with the test file below, the aggregate data footprint is 5.36 GB in size, while the equivalent JPEG 2000 (j2k) file is 60 MB for the entire sequence. Your own archive policies should dictate which extraction format is best for your preservation requirements. The bit depth and resolution of the source files will help in determining the best frame sequence format for your needs.

2. Download this ProRes video test file to use in the conversion. You’ll copy it to a new directory in a moment.

3. Within your terminal/shell window, go to your home directory and create a new directory to store your image sequence.

4. Locate the downloaded TestProRes4444.mov file and move it to the myTestSequence directory you created in step 3.

5. Run the following FFMPEG command in your terminal/shell from the myTestSequence directory:

This command will read the TestProRes4444.mov file and convert it to a j2k sequence at the highest quality (specified by the -q:v 1 parameter). The ‘_%06d’ parameter just before the output file extension pads the image sequence numbers to six digits with leading zeros. You’ll want to adjust this to accommodate the number of frames you’ll be extracting (for example, an hour of video recorded at 30 frames per second contains 108,000 frames). Refer to the FFMPEG documentation for the full set of parameters for the JPEG 2000 format.

6. List the files in your myTestSequence directory. You should see a list of files like these:

7. Although this test file contains no audio, it’s possible to extract embedded audio files from your source movie using the following command:

The -ab 192000 parameter determines the data rate for the extracted audio file. For all image sequence and audio settings, refer to your own internal best practices in determining the preservation file and audio formats, and all of the parameters for both that best satisfy your own archive media strategy. For further information on recommended formats, refer to the reading list below.

By applying this best practice of converting compressed movie files into archival frame sequences and audio files, the media is much better suited to preservation for the long term.

Learn more about archiving with Cloud Storage, and more about the preservation of digital media: