Friday, 16 March 2012

Dude, where's my backup?

I am, as you might expect, stringent about backups. Really. But today, I discovered that my main server hasn't had as many backups as I thought for quite a while.

Here's how it happened.

I have the main server called xanth, and then there's a copy of that called xanti, which I can switch to quickly if xanth goes down. And another copy of that called xantj. And (don't ask) another one called query, which is a copy of xanti. And, by the way, xanth uses mirrored drives, although I put very little faith in mirroring; a screw-up on one side is instantly mirrored into a screw-up on the other.

And then there's the backup server, called foggy, which has three areas called back1, back2 and back3.

There's a daily backup to xanti, xantj, query and one of the backup areas; on the 1st of the month till the 10th, that goes to back1. On the 11th till the 20th, it goes to back2, and on the 21st to the 31st it goes to back3. The reason for this elaborate dance is, if an important subdirectory gets accidentally wiped, then the next day, it's also wiped on the backup! But I'll still have it on one of the others (back1, back2 and back3).

So, how did this all go wrong?

Well, the first thing I did, was disable the back1-back2-back3 dance, temporarily, for a good reason that I no longer remember ... and I forgot to re-enable it.

Query let me down because there was some problem with the data volume, and linux decided to make it read-only ... which means no backups to it.

And then xanti crashed. I couldn't persuade the ethernet to work, I have no idea why, but I think it's a hardware problem. So I swapped out the motherboard. The new motherboard was different to the old one, so the ethernet driver now didn't work. And I don't know what happened to Kudzu (the thing that lets you probe your hardware and auto-install the drivers, I expect it has a new name now, and I couldn't see what it was. So I decided to reinstall linux (while keeping the data, of course). So I did that, updating linux fedora 9 to linux fedora 9; predictably, it decided that it didn't need to do anything. So I upgraded it to linux fedora 10, and it did lots. And when I came to fsck the raid with all the data, there were a zillion errors, some of which boiled down to "the drive is a different size". So I though, fsck this, and I reinitialised the drive and reloaded it from xanth. Which went well.

And it's just as well it did, because for a while there, my main server, xanth, had only one backup, xantj, and I feel a bit nervous in that situation. Although not as nervous as I would feel if I had no backup at all.

So, to summarise, instead of a main server and six backups, I had a main server and one backup. You can see why I've been almost kicking myself. It's what we call "a fault in the liveware".

And you can also see why I have so many backups.

No comments:

Post a Comment