Wednesday, February 11, 2009

How to REALLY back up like a pro

Over at Jay Lake's blog, he has written an entry on how to do backups like a pro. I cannot whitewash this; this is sheer poppycock. What Jay describes is absolutely the safest way to back up your files... circa 1990. The IT industry solved this problem years ago. It's over, done. And the software you need to do so is in the "cheap," "free," or "already paying for it" category.

That's an audacious claim, so let me start by debunking the notion that this baroque sequence of events is actually safe. There's one fundamental problem with it, and that is it relies on humans to not make errors. There are a lot of steps in there. And I don't know about other people, but after I've just spent a couple hours pounding out a few thousand words, I'm not at my most mentally keen. Should you really be willing to bank your security on habitually executing all those steps correctly under those circumstances?

In a word, no. Going through these motions is tantamount to thinking that the TSA guys who make us take off our shoes, and restrict liquids to no more than 2 oz (so we can't make a very BIG bomb?) are actually protecting us from terrorists. They counter what Bruce Schneier calls "movie terrorist plots"--threats that seem large, but in fact are not very likely--while against the real issues, they protects us little, or not at all.

An example, you say? Well, have any of you ever done any of these?
  • Saved a backup file with the wrong filename, so you can't find it later, or you accidentally overwrote a version you wanted to keep?
  • Had your email account hacked? For example, by a spammer, who gets your gmail account permanently shut down within a matter of hours?
  • Sent your backup dvds to the wrong relative, or had the right relative not correctly file them, making them impossible to find should you need them?
Et cetera, et cetera. "Impossible!" you say, "because I really CARE about my data." Consider this article from 2007, which cites a researcher who discovered that human error is the most common cause of security breaches.

"So, Mr. Smarty," you say, "You are not really being part of the solution here." Fair enough. Safely backing up your stuff requires two systems to cooperate:
  • Revision control (also called version control or source control), and
  • Off-site backups.
Revision control is software that is designed for tracking changes to program source code. No reasonable development shop works without it these days. It works thus: when you make changes to a file, you push those changes over to a revision control server (this is called "committing" the file), which remembers ABSOLUTELY EVERYTHING YOU HAVE EVER DONE TO THAT FILE. In the blink of an eye, you can revert to an older version, without destroying the data you've subsequently stored in your revision control system (hence, RCS).

Couple this with off-site backups. The best way to achieve this is to run your RCS on your internet hosting. When you save your changes, you tell the RCS to push the changes to your repository of files on your ISP. (Yes, there are some risks in doing this. There is no such thing as "no risk," only "manageable risk," and it's wise under these circumstances to get someone who knows about such things to advise you when first setting up your RCS, to mitigate this risk.) At any point in the future, you can restore every single file you've ever stored there, to any version you've ever committed.

Meanwhile there are guys who work for your ISP who get paid to do nothing but think about how to keep data from being lost. They use RAIDs, which protect systems from drive loss. They do regular tape backups. Some of THEM do multisite backups, automatically mirroring your data to another node in their network to protect against catastrophic failure.

If you can't see how that's better than gmailing yourself all your files, I have failed at this argument.


Other benefits
Not only is this a solid backup strategy that requires minimal manual intervention, there are several side benefits it gives you for free.
  • If you are working on a project collaboratively, how does your collaborator know they have the most current revision of the file? With revision control, you push changes up to your revision server, give your partner access and let them pull the most current changes using the same software. Unlike email, this works in real-time. You can even lock your files to show that you are working on them, so neither of you stomps on the other's changes.
  • Every checkin to revision control allows you to add a comment. So when you are looking for a particular past revision of a file, you can read "Road Trip: Changed protag from a man to a woman" instead of digging through a bunch of files called "RoadTrip_v132.doc," "RoadTrip_v133.doc" and so on. Additionally, the system automatically tracks commit times and revision numbers, so even if you don't add comments, it's no harder than looking through a pile of hand-versioned files.
  • If you work on multiple machines, like I do, it's a snap to keep them in sync: on your desktop, push changes up to the server, on your laptop, pull them down.
  • If you must have multiple backup sites, revision control makes it painless to keep them in sync, as well. You can even configure one revision control server to automatically push changes over to another (though this takes a little black magic; however, there are plenty of people who will gladly help you set this up for not very much money or free--including me).
All right, enough of my ranting. If I have even cracked your resolve on this, I encourage you, not to take my word on it, but do more research. Talk to your programmer friends. Google for some of the terms I've thrown around in this post. Go look at the web sites of some of the systems I'm talking about; the one I personally recommend for people getting started with RCS is Subversion. It is free, it's widely-adopted throughout the open-source community (lots of people to answer your questions), there are a number of easy-to-use clients for it (such as Tortoise SVN), and it's pretty easy to set up. (In fact, some ISPs that cater to developers, such as Joyent, the one I use, actually have a control panel that will greatly simplify the process.) I used to use Subversion, but if you're feeling ambitious, you might have a look at Bazaar, which is the RCS I use nowadays. (Word of caution: it's a more complex piece of software, so don't let that sour you on the whole RCS strategy.)

Lastly, if you're interested in hearing more about this, please comment. I will be happy to reply privately or answer peoples' questions here.

TimK
Saving the world from arcane backup strategies, one writer at a time.

Labels: , , ,