Pages

Wednesday 29 August 2012

Predicting drive failure

Wouldn't it be nice if you could predict hard drive failures?

Well, maybe you can. There's a facility in modern (meaning, within the last several years) hard drives, that lets you interrogate them to see their condition. In Linux, that's the smartctl command.

smartctl -a /dev/sda

For Windows, you can probably find something if you use Google. Or go to http://en.wikipedia.org/wiki/Comparison_of_S.M.A.R.T._tools

This give you a whole scary-looking table, but I think the important figure is Reallocated_Sector_Ct.

Drives have a number of spare sectors; when it decides that a sector has failed, or is about to fail, it pretends it isn't there and uses a spare sector instead. Reallocated_Sector_Ct tells you the number of times that's been done.

One of my servers has a failed drive, Reallocated_Sector_Ct = 3884. So that must be too many.
Worryingly, on the same server, there's two more drives with values of 1519 and 1746. That server is about to be removed from service and replaced with another newly built server (but using old drives) where all the Reallocated_Sector_Ct values are zero.

As usual, Wikipedia gives an excellent explanation of this http://en.wikipedia.org/wiki/S.M.A.R.T.

2 comments:

  1. You just need a HAL-9000...

    "Just a moment...just a moment...I've just picked up a fault in the AE-35 unit. It's going to go a hundred percent failure within 72 hours."

    Of course he could be lying.

    ReplyDelete
  2. You've presumably seen this paper from Googlers (who have enough drives to do reasonable analyses!)

    http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/disk_failures.pdf

    From the abstract:
    Despite this high correlation, we conclude that mod- els based on SMART parameters alone are unlikely to be useful for predicting individual drive failures.

    ReplyDelete