Tuesday 3 March 2015

Is your hard drive going to fail?

Yes, obviously. But is it going to fail in the near future? Maybe you think that's in the lap of the gods, and to some extent, it is. But there's a way of predicting future fails. It's called SMART, Self-Monitoring, Analysis and Reporting Technology .

I keep tabs on all my systems, by occasionally running "smartctl" on each of my drives.
That's a linux program, but I'm sure there's equivalents for Windows and Macs. That reports a whole bunch of things, but the one I look at is the "reallocated sectors count" (RSC).

When a sector goes bad, the drive cleverly marks that sector as "don't use it" and uses a sector that's in its special reserve area instead. Every drive I get these days, is SMART-enabled, and lets me read the RSC. And I keep a historic record of this.

Obviously, zero RSC is ideal, and if a new drive doesn't have zero, I'd return it under warranty. Maybe I'd accept one or two, but it's always been zero.

The drive will then go for years with zero RSC, but one day, you'll see one, or a few. This stays at that level for a while, then slowly inches up.

Eventually, it reaches a few hundred, and that's not good. You should start saving up for a replacement.

With a 1000 gb drive, at least, with the Maxtors that I use), 2047 is the maximum (Seagates have a larger RSC maximum). When the RSC gets to over 1000, that's a signal to me to replace the drive. It's still working, so it's easy to copy the data onto the replacement (I use rsync, but there's probably Windows and Mac equivalents), and dispose of the old drive. I wipe the hard drive by writing zeros over the whole drive, then I put them into caches; I always carry a drive in my bike saddlebag for when I encounter a sufficiently large, and dry cache to put it in. These days, that's not common, maybe one cache in 500 or more?

Other drives have other limits. I had a 3000 gb Seagate drive that had and RSC of 31368 - as soon as I saw that, I replaced it! Another 3000gb drive had 40736. A Seagate 1000gb had 4012 - it went from 45 to 4012 in ten days. That's another signal; if the RSC is changing rapidly, be warned.

A Seagate 750gb drive had 4010 RSC, a Seagate 2000gb had 19456. I've never found a source that would tell me the maximum number for each type of drive.

So what happens if you go on using the drive?  It will run slightly more slowly, because the drive will have to do extra seeks to read those reallocated sectors. But eventually, there will come a time when the drive can't read a sector, despite many retries. And then it won't be able to read a few sectors, and then it won't be able to read many sectors, and then you have a nearly failed drive, with some unreadable files. and that will, of course, get worse and worse.

Of course, drives also fail suddenly. I had that recently; I was replacing a drive that had more RSC than I was happy with, and when I started the computer up again, the system drive failed. Totally. Not a big deal, it just means that I put linux on a new drive, because for all my big computers, the system drive is separate from the data drives. So you should think that monitoring your RSC is a substitute for good backups!

So that's my policy. When a drive's RSC is above 1000, then it's time to replace the drive.

No comments:

Post a Comment