Saturday, 14 May 2016

Friday 13th

Another unlucky day.

One of my servers  was acting a bit strange - lots of zombies, which means processes that don't die even though they've been killed. If you want to know more about zombies, google "zombie unix". I don't know of any good way of getting rid of zombies (although if you do nothing, they fade out eventually), even shooting them in the head with kill -9 doesn't work. The only way I know is to reboot the computer. So that's what I did.

Well, actually, you can kill a zombie by killing its parent (which seems drastic and unfair), but when the problem is not just a couple of zombies but a fully-fledged Zombie Apocalypse, that's a feeble response. Reboot.

Power off, power on ... and nothing happened. So I got a hands-and-eyes, and she pressed the on-button (which shouldn't be necessary, because I have the servers all set to not need that). And still nothing happened. So I got her to put on a keyboard and monitor, and the server was asking for someone to "press F1 to continue". That means that the little lithium battery has run out of power and the server has forgotten its configuration, which is why it didn't power up.

So she pressed F1, and the boot process went as normal, but I still couldn't ping the server. And when she tried to log in, the server died. So this server has some major hardware problem that I'm not going to be able to fix remotely, it will have to wait until I visit the colocation, which tends to be once per year.

We tried it a couple more times, but it was clear that this server was now pushing up daisies, so I thanked her, and powered that server off.

Of course, I can't leave it like that, this is a customer-facing server. So I powered up the backup server, which was last updated about a month ago, and I refreshed the data on it from the daily backup, which means I might have lost a few hour's worth of stuff, but them's the breaks. And I used my firewall to redirect accesses from the old server to the new one. I love my firewall.

I also had to do a certain amount of tidying up, because the configuration files weren't totally up to date, but that wasn't too bad. And then I checked that everything was working (which led to a few more tweaks), and then it was OK, and it was half past midnight, and I'm still trying to catch up on sleep from the Night of the Great Toothache.

Only one customer noticed that there was a problem, and that was because he went to the server after I'd brought it up but while I was still updating it, so I explained to him about that server being a month out of date but not to worry because I was restoring a backup from yesterday right now, and he was happy.

So now, with the lessons learned, I'm bringing two more backup servers completely up to date, because I always like to have at least two backups.

No comments:

Post a Comment