Tag Archives: checksum

FLAC in the archives

The first time I heard about FLAC was from a co-worker within the early days of my first full-time audiovisual archivist gig. I was trying to start digitization projects and figure out preservation practices. He was working in a half-IT and half-broadcast-engineer capacity and happy to support archival work where he could help. We were discussing audio preservation and digitization of 1/4″ audio reels and he remarked on how FLAC was really an ideal choice for this type of work. I hadn’t heard much about FLAC before but based on the list-servs of ARSC and AMIA knew that when an archivist is asked to select a digital audio format that really Broadcast Wave Format (BWF) was the only legitimate choice. We went on to debate preservation objectives and the advantages and disadvantages of one format versus the other. Broadcast Wave was the “best practice” in digital audio archiving, but by the end of the conversion I was questioning why I was defending it.

My colleague clarified that the choice between FLAC and BWF was not about audio quality since FLAC is a lossless audio encoding. A FLAC encoding of an audio signal and a BWF encoding of an audio signal (at the same specifications) will decode back to the same audio signal, but the FLAC file was much smaller (about a third the size of the uncompressed audio). He clarified that FLAC is an open format well supported by free software. During this conversion I was imagining the shock and disbelief that may emit from various archival communities to know that a n00b archivist was being lured towards the lossless audio codecs of Free Software. For BWF I didn’t have much of a defense; it was a well-respected standard across the audio archiving community, but at that point I didn’t know why. I feebly tried a BWF defense by pointing out that because the BWF file is larger than FLAC that it may be more resilient since a little bit corruption would have a more damaging effect on the compact FLAC as opposed to the vast BWF file.

Following this conversion I searched archival listservs for references to FLAC and didn’t find much though I did find references to FLAC in archival environments at http://wiki.etree.org and band sites. This research also led me to the communities that develop FLAC and related applications. Around that time their work was especially productive as noted in their change log. All this left me confused as if FLAC and BWF play the same singular role in two parallel archival community universes.

For the time, I would digitize analog audio to BWF and sleep well. There was a large amount of audio cassette transfers, CD ripping, and reel-to-reel work and we worked to keep the decks running day-after-day to achieve our preservation goals. As the data piled up digital storage became an increasing complicated issue. The rate of audio data that was being created was simply larger than the rate of digital storage expansion. As storage stresses began to grow FLAC looked more and more tempting. Finally in 2007 FLAC 1.2.1 added an option called –keep-foreign-metadata which meant that not only could I make a FLAC file from a BWF that losslessly compressed the audio but I could also keep of the non-audio data of the BWF as well (descriptive information, embedded dates, bext chunks, cart chunks, etc). Basically this update meant that one could compress a BWF to a FLAC file and then uncompress that FLAC back to the original BWF file; bit-for-bit. Knowing that I could completely undo the FLAC decision at any time with these new options, I finally went FLAC. Using the FLAC utilities and tools such as X Lossless Decoder I compressed all the BWF files to FLAC, recovering substantial amounts of digital storage. This process involved a lot of initial testing and workflow tinkering to make sure that the FLAC compression was a fully reversible process, it was, and I was happy to finally make the preservation-standard switch and invest in learning FLAC inside and out.

[ technical interlude ]

If you wish to convert WAVE files to FLAC files in a preservation context here is how I recommend you do it. Firstly, use the official FLAC utility to get the options mentioned below or a GUI that gives you access to these options. The following are a list of FLAC utility options that I found relevant:

We can wait for the most beneficial result. The –best option will prioritize file size reduction rather than encoding speed.

For WAVE files or AIFF files this option will cause the resulting FLAC to store all non-audio chunks of data that may be in the source file. Ideally this option should be used during all FLAC encoding and decoding to ensure metadata survives all procedures.

Optional, but I found this handy. This option applies some of the timestamps of the source file to the output, whether going from WAV->FLAC or FLAC->WAV.

Verify! Digital preservation is always an environment of paranoia. This option will cause the utility to do extra work to make sure that the resulting file is valid.

If everything else is successful this will delete the source file when the FLAC is completed.

In addition to these option I recommend logging the stdout, stderr, and original command along with the resulting output file.

Putting this altogether the command would be: flac --best --keep-foreign-metadata --preserve-modtime --verify --delete-input-file audiohere.wav

When running this command the file audiohere.wav will soon disappear and be replaced by a much smaller file called audiohere.flac. To reverse the process add the –decode option: flac --decode --keep-foreign-metadata --preserve-modtime --verify --delete-input-file audiohere.flac and then you get the wav file back.

[/ technical interlude ]

The file size advantages led to benefits in other types of processing. Flac files could be uploaded to the Internet Archive in a third the time as a wav file, we could move more audio data from DATs or CDs to LTO storage.

A few years later I realized another bonus of FLAC as an audio preservation file format that seems fitting within digital preservation which is the strong fixity integrations. Each FLAC file contains an md5 checksum of the encoded audio in the header. With this feature a specific audio recording could be encoded to many different FLAC files which may differ (one FLAC may be encoded for speed, another for size, another containing extra metadata) but each FLAC file would contain the same checksum which represents the source audio data. This is often called the FLAC fingerprint. etree.org has some great resources on the FLAC fingerprint at http://wiki.etree.org/?page=FlacFingerprint. The fingerprint gives all FLAC files a built in checksum and thus any FLAC file could be tested as to the integrity of its encoded data. If a FLAC file is truncated through partial download, corrupted, or manipulated in a way that would affect the audio data then the FLAC file could be identified as invalid or problematic without needing an external checksum file.

Deeper within the FLAC file audio samples are grouped into audio frames which themselves are checksummed with a crc value. If a FLAC file suffers from bit rot or other corruption then a FLAC decoder such as ffmpeg’s can report on precisely where the problem is. This reporting allows an archivist a more efficient ability to resolve the problem.

To show how this works I’ll make a small 5 second FLAC file of a sine wav with ffmpeg like this: ffmpeg -f lavfi -i sine -t 5 sinewav.flac. Then in a hex editor I’ll just change one bit, the smallest corruption. To test the file I can use the test feature in the flac utility like: flac --test sinewav.flac which gives:
sinewav.flac: ERROR while decoding data

but this error isn’t very clear. The test shows that a crc checksum stored within the flac files failed validation so that there was some change after encoding, but the report doesn’t show where. FFmpeg does this a little better. If I decode the flac file with FFmpeg like: ffmpeg -loglevel error -i sinewav.flac -f null - then I get more specific news.

FFmpeg reporting a crcerror from a corrupted FLAC file.

FFmpeg reporting a crcerror from a corrupted FLAC file.

PTS stands for presentation timestamp. The value 82,944 here refers to the sample where the problem starts. Since the sample rate of sinewav.flac is 44,100 then I can divide 82,944/44,100 to get 1.88 seconds which is where I can find the problem. Here is the corresponding area as shown by a waveform image in Audacity.

Audacity showing a corrupted flac file.

Audacity showing a corrupted flac file.

Because a FLAC file contains an md5 checksum of all the encoded data and crc checksums for each frame of encoded audio it is possible to discover which fairly accurate precision what areas are affected by corruption. A wav file doesn’t have such a feature, would require an external checksum to allow for any integrity testing, and would not provide a feature to pinpoint corruption to any particular area.

Moving into different archival projects I’m certainly quicker to consider FLAC a significant option in digital audio preservation. “Best practices” in archiving might not necessarily be the best use of current technology. Best practices require ongoing re-evaluation and improvements and I’d rather refer to them as “good-enough-for-now practices”. At least for me, FLAC is good enough for now.

Reconsidering Checksums published in IASA Journal

Last month the IASA Journal published an article I wrote on error detection and fixity issues. While IASA agreed to publish the article under an open license, in this case CC-BY-ND, the journal does not (yet) have an open access policy.

The article discusses two different approaches used in the application of checksums for audiovisual data: embedded checksums data used to audit transmission (MPEG CRCs, FLAC Fingerprints, and DV parity data) and external whole file checksums (more typical to digital preservation environments). In the article I outline how the effectiveness of a whole file checksum does not scale well for audiovisual data and make proposals on how formats such as ffmpeg’s framemd5 can enable more granular and efficient checksums for audiovisual data.

terminal output of ffmpeg evaluating framemd5 for an input file

The article may be found in IASA Journal Number 39 (login required) or re-posted on this blog here.