burning landscape after volcano eruption

Backups – Junior Developer Handbook

This is the absolute minimum you need to know about Backups. Having a solid backup and restore strategy is crucial because you will be hit by something at some point: Hackers, trolls, ransomware, hardware that went rogue or maybe a mistake you made yourself. As usual, there is a decision to be made on how far you want to go with the whole thing considering cost and risk of disaster. Having at least something is the minimal requirement, going in raw is insane.

Uh oh, where’d the backup go?

Let’s assume you have a backup that works well. In a nightly job, you pull all the changed data from the day and store it somewhere else. You get into work on a sunny spring morning and find everyone is out of their minds because the data is all gone. You stay calm, go to your computer and think about all the blessings you’ll get once you hit the magic button and restored everything. But there is nothing to restore. Your colleague who’s responsible for causing the issue has rm -rf ‘ed everything, including the backups. Uh oh.

In another scenario, the data isn’t gone but encrypted by ransomware X that infected your company. The backup is stored in some network file share and also fell victim to the ransomware. Uh oh.

You need to protect your backup from being accidentally deleted or overwritten. Make it read-only once it is completed.

The storage server died and with it the backup

Your backup should be available to you at all times, no matter what happens. Storing the backup on the server that would need the backup in case of failure is a bad idea. Server dead. Backup dead. Also, think about virtual machines and storage servers. If the VM host dies and you have both the backup VM and the production system VM running on it, you’re out of luck. If you store the backup on the same storage appliance (SAN, NAS, …) your system uses, you’re out of luck if the storage appliance dies. Also if you have a network share to backup to on another continent, if that connection stays open all the time and you get hacked you’re out of luck.

You need to store your backups on independent infrastructure. Even better on infrastructure that is off-site. Even more betterer if the infrastructure is offline when it’s not in use by a backup or restore procedure.

Why is John from HR deleting our backup schedule?

You had a security breach and now your active directory has a bad person who’s going nuts on everything he can access. The intruder also managed to pull off a privilege escalation so he can and will go nuts on the backups as well.

Your backup system needs identity and access management (IAM) that is completely disconnected from everything else. I know this is not a Microsoft best practice but please, don’t join backup-related things into your active directory. Use SSH public key authentication to get to the backup infrastructure or even better plug in a monitor and keyboard into the physical machine if you need to administer it.

The backup also has bad data in it

The new backup was written and an old backup from ten days ago was deleted. On day eleven you notice there is something not quite right. You did research and found the bad thing happened eleven days ago. The oldest backup you have is ten days old. You have no backup.

Your backup needs to be versioned and incremental so you can restore to any point in the past within reason, like a year or something depending on your use case.

Restore worked fine five years ago

Your backup survived. Your backup has nice and correct data in it. The magic button makes the progress bar for the restore go from zero to ten, to twenty, to fifty and then it dies. Nobody knows why it does not work now. Had we only tried it last week or something, we would’ve had time to investigate without being in a hurry and the boss being on our lap constantly.

You need to test your backup and restore process regularly. Only then you can be somewhat confident it will work in case you need it to work urgently.

Hell yeah we do backups, I guess?!

The backup is needed urgently. Upon opening up your backup software you notice a lot of red color coming out of your display. The backup procedure failed. It failed a hundred and fifty-two times in a row. No one noticed. You don’t have a recent backup.

The backup is, again, needed urgently. You start the restore process. Your backup software does an integrity check before actually starting the restore process. You notice, again, a lot of red color coming out of your display. The integrity check failed. You cannot restore the backup.

You need to constantly monitor your backups for 1) successful completion and 2) data integrity.

We’ll be up and running again in … 2 months

The boss comes in and asks if the backup is restoring okay. You see him take a big breath of relief when the team tells him: Yes, the process is running as expected. He congratulates you for your hard work and effort to get the company running again and then he asks parenthetically when it will be done. His relaxed demeanour freezes up immediately when the team answers that it’ll take about two months until it’s done. The company went bankrupt a month later as it had no revenue but many people on payroll.

A lot of companies have vast amounts of data but to continue to operate it needs only a fraction of it. Historical records are not as important as currently open orders. The production schedules for the coming month are vastly more important than the past schedules. All the data in the employee-parkingspot-assignment system is nonessential.

You need a solid priority concept for restores. Of course, you need to test and verify that regularly as well.

Summary – TL;DR

Your backups must be

  • Read-only once completed,
  • stored on infrastructure that is both logically and physically detached from the rest of the system,
  • isolated in terms of identity and access management,
  • versioned (incremental updates),
  • verified end-to-end (make sure backup and restore work as expected regularly),
  • monitored for failures and integrity and
  • able to restore the most critical data first and fast.

Photo by Marc Szeglat on Unsplash