#5 External Recorders

External Recorders come at #5 in my list of Tops of 2015.

A few years ago, the video camera was the most important device used in production. It included several functions: optics, sensor, encoding and storage. But the trend is to do encoding and storage outside of the camera, in an external video recorder.

Solid state drives (SSD) and high-speed connectivity through HD-SDI and Thunderbolt enable live recording of RAW and uncompressed feeds from 2K and 4K cameras. ProRes and DNxHD formats, previously only used in post-production, are now productions formats.

Camera <----------------> Aja Ki Pro Quad

Where we had a closed device, now we have an open system that is more versatile, but also more fragile.

Lessons of Redundancy

A few weeks ago, Mark contacted me to recover a 50 GB file containing footage from a concert.
The event was recorded in DNxHD format by an Atomos Ninja 2 recorder.

Mark is a seasoned professional, he always run two recorders for redundancy, prime and backup.
The prime recorder stopped and no one noticed until the end of the program.
This situation is not frequent, but it can happen, this is why a backup recorder must always be running when recording a live event.

Unfortunately, the backup recorder also had its share of problems, it crashed as well!

The statistics say that if a system has a failure rate of 1%, when you run two systems for redundancy the chances of having the two systems fail at the same time will be extremely low:

1% of 1%, or 1 out of 10000

That’s why redundancy matters if you are to keep your reputation as a professional:
1 out of 100 will happen almost every year, but 1 out of 10000 is more like doing a hole in one in golf: It may happen once in your carreer, and it may never happen at all.

But that’s the theory.

In the real world, I’m seeing far more failures in redundant recordings than the statistics predict.
Let me explain the three pitfalls of redundant recording and how to avoid them.

Redundancy Pitfall #1: Prime vs Backup

Mark had two recorders set-up, but had they the same chance of success?

Probably not. My guess is that the prime recording set-up was first rate whereas the backup set-up was a disaster waiting to happen.
For the prime recorder, Mark had used the gear in best condition, a very reliable recorder with encoding power in excess and fast storage.
For the backup set-up, Mark had used the old recorder, a bit underpowered for the job, and a memory card that had never given problems in the past, but that is 80% full (of footage from a previous job…)

Does it sound familiar? Mark had a redundant set-up, but the backup recorder had a failure rate probably around 20% instead of theorical 1%.
Unfortunately, this was one of those bad days for Mark: With the backup recorder struggling with encoding, some frames started dropping, and as the card was getting filled, fragmentation caused the write speed to plummet, finally causing the recorder to crash.

Therefore, a disaster like this is not due to a single failure, but to an accumulation of problems.
Redundancy is only safe if the two recorders are working in optimal conditions. Or to put it in another way: Your reputation as a video professional is in the hands of the less reliable of your two recorders, usually the backup system.

The mere labelling of the recorders as “prime” and “backup” is the first breach of redundancy. The two are equally important, and should rather be labelled “prime even” and “prime odd” and you would use the footage from the even recorder on even days, from the odd recorder on odd days.
That would protect you against the “broken backup” syndrome, more on that in pitfall #3.

Redundancy Pitfall #2: Independence

In statistics, one of the most important concept is variable independence:
An independent variable is a variable whose variation does not depend on that of another.

The theorical “1 out of 10000” rate is based on the assumption of independent failures.
It means that a common cause can never produce a failure in the two systems.

In the real world, this assumption is often false!
For example, if the recorders are connected to the same power source, or if they are writing to the same storage device, it’s clear that we don’t have redundancy, only the illusion of it.

So the solution is to have two identical systems (to avoid pitfall #1) but that are fully independent? Nope.

Identical systems are not independent due to latent flaws. That’s more subtle, but can be as devastating:
Imagine two identical recorders: same model, same firmware, same recording settings… If this model of recorder starts to fail when temperature is over 35 degrees, the two recorders will be at risk because they are exposed to the same temperature and workload. The second recorder will probably fail a few minutes after the first one.
Same for firmware bugs: if a recorder crashes under a certain condition, the other one will also probably crash.

In the Space Shuttle, there were 5 computers running in parallel, but one of them running software written independently. This was the safeguard against a bug affecting all computers.

Therefore, to build two truly independent systems is harder than it seems.
You only operate a redundant system when both prime and backup are equally reliable and truly independent.

But there’s a last problem…

Redundancy Pitfall #3: Supervision

Redundancy is its own enemy.
We are humans, and when we know that something is almost 100% safe, we tend to take success for granted and to stop worrying.

Lack of supervision is what can finally put your redundant system at risk.

If you watch your two recorders, you will immediately notice when one of them stops, and take action.
On the other extreme, if you never check the results of your backup recorder (because prime recorder works), you can overlook a systematic failure present in your backup procedure, and you are no longer protected. That’s the “broken backup” syndrome.

RAID 5 is a redundant storage system where 4 hard disks work together, so that if any of the 4 disks fails, it can be replaced by a new disk, and without stopping the system the remaining 3 healthy disks “rebuild” the content of the failed disk, and redundancy is restored after a few minutes.
In theory, RAID 5 systems are almost 100% reliable, because the data is only vulnerable during the few minutes where rebuilding takes place. The rest of the time, the system is redundant.

However, the #1 failure mode of RAID 5 systems is lack of supervision: Nobody detects that one of the disk has failed, and the RAID5 continues to work in degraded mode for weeks or months. But one day, a second disk starts having errors, and unfortunately it’s too late to rebuild the system, with two failed disks the system has become unrecoverable.

Therefore, please add this to your new year resolutions list:

I will verify my redundant recording procedure and I will:

1. Label my recorders “Even” and “Odd” and, based on the day of the month, use one or the other as “Prime”
2. Verify that my two systems are truly independent, that a single cause cannot produce a failure in the two systems
3. Watch my recorders and investigate any issue detected.

Happy new year 2016!