Monday, 27 July 2015

Time travel

Lots of fun recently!

My central data server (xanth) uses mirrored drives, because it's a very important server. And, of course, it has backups - I can immediately think of six separate backups. I'm careful.

So a couple of days ago, I was having trouble accessing it. Access was intermittent, and even when it was working, it was very slow. I went into fault-finding mode. First, I looked at the hard drive reports, and one of the drives was reporting that it wasn't working. I did a power-cycle, that fixed that, but the problem persisted.

After much trial and error, I noticed that if I accessed it from another server within the computer room, it was fine, it was only when I accessed it across my network that it was slow. So I changed the network switch upstairs (it runs very hot, and is about 20 years old). I changed the network switch leading to that server downstairs. I power-cycled the network switch that connects all the rooms together. No joy.

Then I remembered something. I'm doing a mammoth copy, to get several servers ready for taking to my colo. And in order for that not to mess up my network, I had all the servers being copied from and to, on a separate segment. Except xanth! And what I was seeing was network saturation; the network was so busy with the copying, it wasn't able to do much else.

So I killed the copying processes, and put a different server on the special segment for the copy, and xanth was fine! Panic over.


Several hours later, I loaded up my logging spreadsheet; and all the logs for the last several days had vanished. How can that be? I checked, and I was looking at a seven day old copy of that file. Where did that come from?

Fortunately, my logging spreadsheet isn't that important, I do it so that I can keep track of usage trends. But then I thought, what else has gone? A thorough investigation revealed that, on xanth, any file more recent that several days ago, had vanished.

It was like I'd gone back in time by seven days. Spooky!

So I've thought about this, it's not something I want happening too often! And I've come to the conculsion that here's what happened.

Several days ago, the primary of the mirrored drive, dropped out. Drives do that occasionally; they just refuse to repond, but when you power-cycle, they come back (but I wouldn't trust it too much, it's liable to go again). When that happened, the clever mirroring software fell back on using the mirrored drive, and I was none the wiser.

When I cycled the power, the main drive came back online.  The clever mirroring software said "Aha, the main drive is back, I'll mirror what's on it to the backup drive". And so six days of changes went up in smoke.

The above is just a guess, but I can't see any other way for the drive to have gone back in time.

I've never been a big fan of mirroring, and the reason has always been, that I don't know what it does. I know in general terms, but I don't know exactly what decisions the software will make as a drive goes down or up.

So I'm making a new server for my central data server.

And this time, I won't be mirroring.

