Pages

Saturday 11 February 2012

Meltdown!

At about 11:30 pm last night, I had a meltdown. The immediate cause was that one of my UPSes decided to suddenly conk out. That brought down all the servers that it was feeding. And for some reason, the lack of access brought down the workstation that I mostly use.

So I went down to the garage, and immediately saw that a bank of servers was off, plus the ethernet switch. I messed around with the UPS for a few minutes, but soon decided to abandon it, look at it later. I brought in another UPS, a much smaller one (I usually have a couple of spares), and started to power that up. Then I thought, maybe it's not a good idea to put all the servers onto the smaller UPS, so I rearranged the power cables, and put the servers that I'm loading up for taking to Cheltenham on power without a UPS (it's less important if they get a power glitch, and even a plus, because it'll test their resilience to power-off and power-on).

All the servers powered up just fine, I'm glad to say. But on one of them ... I'd combined the two 2tb drives together into a 4tb raid, which you can do under Fedora 16 (you couldn't do that until recently), and I'd called the resulting monster /dev/md0. But when I tried to revive /dev/md0, it wouldn't. It kept telling me "already in use". I tried googling, but that didn't help, and I tried various mdadm commands istead of -assemble, including -build and -create, but nothing worked. Eventually, I decided that this version of mdadm, or maybe just using 4tb, is too flaky, and I'll do this using two separate 2tb drives. So I tried to make a file system on one of the drives. And it told me "already in use".

Feh.

By this time, it was about 1am, and I was getting tired, but I pressed on. I had a look in /dev, to see if there was a /dev/md0 (there wasn't) and I noticed that there was a /dev/md127. Hmmm. I wonder what that is? So I fscked it, and mounted it, and it was the raid that I'd created and called /dev/md0. I have no idea why Fedora decided to change the name.

Also - when I tried to reboot my workstation, which had just totally frozen, the reboot got stuck as it was loading up the haldaemon. The answer to that turned out to be patience - after a while, the boot completed. Again, I don't know why.

On the UPS issue - I'm fed up with UPSes that are too heavy to handle, 50 kg or so. In future, I'm going for lighter ones, 25 kg or so, even though that means they have less stored power. The reason is, I get two kinds of power cut. One is a quick blink, wihch would reboot the computers, but which any UPS can handle, and the other is a multi-hour blackout caused by a major grid problem, which takes hours, and which no UPS would handle.

No comments:

Post a Comment