Thursday 11 December 2014

Disk drive diversions

The problem when you put 15 drives in one computer, is that it's 15 times as likely to develop a bad drive. But that takes up a lot less space and electricity than 15 computers, so that's why I do it. I have a drive monitor on each computer that reports back to my central monitor system when drives are looking bad.

Last night, my drive monitor reported that drive sde on Dovda wasn't quite right. I left it till the next day to look into it. The drive sde (on a raid array md0) had "dropped out". I don't really know what this means, it's an expression I use when a drive stops responding. The fix is to power-cycle the computer; this resets the drive and it's OK ... until it does it again at some time in the future.

So that's what I did for Dovda. And when I ran the SMART drive check, only 14 drives responded. "Oh rats," I thought, the power cycle didn't do the trick. But on more careful examination, it had. The drive that wasn't working was sdi. A couple more reboots didn't help, so I got Dovda onto the work bench, opened it up and booted with a monitor connected. The problem was immediately obvious; drive sdi was showing up with zero gigabytes. Not a big problem, I have two separate backups of this, but I'd rather not replace the drive unless I have to.

This is a known problem with Seagate drives, it's called the "Seagate 0 LBA" problem. I've had it a few times, it happens, I think, after you start up the drive 256 times. Or something like that, I never really tried to understand it, because it seems to only happen with 1tb Seagates, and I stopped buying those several years ago, the standard is 6tb now. Here's Seagate's explanation.

There's a fix for it. You have to connect to the serial port on the hard drive, from the serial port on a computer. I have a little PCB with the necessary wires to do this, and you run Minicom, 38400,N,8,1 and type in a series of commands. This resets something inside the firmware, and the drive then starts working again. So I did it, and it worked, except that when previously it knew that it had 89 bad sectors and had mapped them out, now it thought there were zero bad sectors, and that sounds like trouble for me in future. Oh well, it's "good enough", so I started the computer up again.

This time, it recognised all 15 drives, hurrah! But the raid array wouldn't start up, it was telling me that drive sde was unsuitable. Huh? It was perfectly suitable before. I tried re-creating the raid drive, but that didn't work either. So I had a bit of a think, and a bit of a google, and eventually realised that there was already a raid running that called itself md127 using that drive (and two others). So this is an example of a misleading error message. Where did md127 come from? Heaven knows.

This happens far too much for my liking. The software has realised that something is wrong, but has misdiagnosed the problem. It's difficult to write the logic for diagnosing errors, because it's very difficult to test. So I just changed all references to md0 in my startup for Dovda, to md127, and it worked ... nearly.

Now I'd finally got back to the original problem. The file system on the raid was in an inconsistent state, and it needed to be fscked. So I fscked it, and then everything was OK.