Search

smartmontools: control the health of your hard disk

October 12th, 2008 edited by Tincho

Article submitted by Noel David Torres Taño. Guess what? We still need you to submit good articles about software you like!

One of the packages I manually install in every new installation is smartmontools. I’ve some expertise in managing computers and networks, and it is a fact that pirate hackers and software bugs are not the main cause of problems in small and medium installations. Hardware is.

Thus, you have hardware that can fail, and Murphy says that if it can fail, it will. The point is not to avoid hardware failures, which would be impossible, but to detect them early or even prevent them.

Particularly for hard disks, the tool in charge is smartctl from the package smartmontools. IDE disks (if they’re not of the age of dinosaurs) have an integrated self-testing tool called SMART which means “Self-Monitoring, Analysis and Reporting Technology”. Modern SCSI disks have it too if they’re SCSI 3 or newer. It happens that inside the disk chipset there are routines to check parameters of disk health: spin-up time, number of read failures, temperature, life elapsed… And all of those parameters are not only registered by the disk chipset, but they have designated security limits and both parameters and limits can be checked by software who access the disk using the appropriate I/O instructions.

And that software is smartctl, a piece of the smartmontools deb package. Of course, since they access the disk in a raw way, you need to be root to use these commands.

smartctl can ask the disk for its smart identification:

# smartctl -i /dev/sda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Fujitsu MHV series
Device Model:     FUJITSU MHV2060BH
Serial Number:    NW10T652991F
Firmware Version: 00850028
User Capacity:    60,011,642,880 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 4a
Local Time is:    Mon May 12 02:39:31 2008 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

More interesting, smartctl can ask the disk for its parameter values:

# smartctl -A /dev/sda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0×000f   100   100   046    Pre-fail  Always       -       124253
  2 Throughput_Performance  0×0004   100   100   000    Old_age   Offline      -       18284544
  3 Spin_Up_Time            0×0003   100   100   025    Pre-fail  Always       -       0
  4 Start_Stop_Count        0×0032   099   099   000    Old_age   Always       -       1199
  5 Reallocated_Sector_Ct   0×0033   100   100   024    Pre-fail  Always       -       8589934592000
  7 Seek_Error_Rate         0×000e   100   087   000    Old_age   Always       -       1761
  8 Seek_Time_Performance   0×0004   100   100   000    Old_age   Offline      -       0
  9 Power_On_Seconds        0×0032   079   079   000    Old_age   Always       -       10866h+57m+47s
 10 Spin_Retry_Count        0×0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0×0032   100   100   000    Old_age   Always       -       1199
192 Power-Off_Retract_Count 0×0032   099   099   000    Old_age   Always       -       283
193 Load_Cycle_Count        0×0032   100   100   000    Old_age   Always       -       6953
194 Temperature_Celsius     0×0022   100   100   000    Old_age   Always       -       45 (Lifetime Min/Max 14/58)
195 Hardware_ECC_Recovered  0×001a   100   100   000    Old_age   Always       -       62
196 Reallocated_Event_Count 0×0032   100   100   000    Old_age   Always       -       459276288
197 Current_Pending_Sector  0×0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0×0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0×003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0×000e   100   082   000    Old_age   Always       -       22371
203 Run_Out_Cancel          0×0002   100   100   000    Old_age   Always       -       1533257648465
240 Head_Flying_Hours       0×003e   200   200   000    Old_age   Always       -       0

As you can see, there are some attributes marked as “Pre-fail”. If any of these attributes goes beyond its threshold, the disk is about to fail in hours, maybe minutes.

Even if there are more options to smartctl , the last ones I will comment here are -a and -t.

smartctl -t launches a disk test. It needs a parameter indicating the type of the test, and in the longest case it can last for tens of minutes and will check the electrical and mechanical performance as well as the read performance of the disk, going through all its surface. smartctl -a, in its turn, shows all available information about the disk, including self testing results. Since tests will span minutes or tens of minutes, we can not see them happening. All what we will get when launching tests is like:

# smartctl -t long /dev/sda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: “Execute SMART Extended self-test routine immediately in
off-line mode”.
Drive command “Execute SMART Extended self-test routine immediately in
off-line mode” successful.
Testing has begun.
Please wait 41 minutes for test to complete.
Test will complete after Mon May 12 05:44:03 2008

Use smartctl -X to abort test.

Here, we’re being informed that (maybe) we will get a slightly lower performance on the disk for the next 41 minutes, since the test has started. It is completely background, or better ‘underground’, since it does not happen under the kernel control at all: everything is happening internally to the disk, and all what we can get is the result.

smartctl -a, in turn, show a very large amount of SMART information about the disk: almost all stored SMART information parsed for us. It is usually better to use a more specific switch, see the man page for details.

Finally, I want to comment that there is a daemon in the smartmontools package, smartd, who can take care of doing tests for you. It works by running smartctl in a periodic way (typically every 30 minutes) and logging all errors and parameter value changes to the syslog. The default configuration in Debian will also mail root if there’s any problem detected. I will not explain here about it, because I want you to read its (short and easy) documentation, but remember that in order to use it you must enable it in /etc/default/smartmontools.

The smartmontools package has been available both in Debian and Ubuntu since a long time ago.

Posted in Debian, Ubuntu |

20 Responses

  1. LMZ Says:

    thanks for article about monitoring hardware in linux. Waitin’ for another one!

  2. Marko Kevac Says:

    As i can see all your Pre-fail values are over the treshold. What happened to hard drive?

  3. Vid Says:

    Ubuntu does not automatically activate to reduce the Load_Cycle_Count (ID#193), which can reduce the life cycle of laptop disks. Currently it has to be manually activated in Ubuntu (by people who dont know how to put the command in a script file).

    If I activate laptop-mode, it fails to function in the battery mode. That is not a nice feature. Is there another solution available to protect laptop disks from frequent head (un)parking??

  4. trollenlord Says:

    There ( https://bugs.launchpad.net/ubuntu/+source/smartmontools/+bug/16386 ) is how it should work, including screenshot. However, smartmontools is too much trash to be able to do that properly. It’s horrible quality, duct tape over duct tape, added in layers for the last years. It should be completely rewritten to be worth anything.

  5. Noah Slater Says:

    What is a pirate hacker?

  6. milton Says:

    As Marko Kevac said:
    “As i can see all your Pre-fail values are over the treshold. What happened to hard drive?”

    ==> How to solve the errors ? replace disk ? check tests ? ***What*** can we do ?

  7. Noel Torres "Envite" Says:

    Marko Kevac Says:
    October 12th, 2008 at 6:59 am

    As i can see all your Pre-fail values are over the treshold. What happened to hard drive?

    You must look at the non-raw values. There, the disk has a problem if it falls under the threshold.

    Noah Slater Says:
    October 12th, 2008 at 12:53 pm

    What is a pirate hacker?

    The kind of person with the abilities of a true hacker but without his ethics. Consider the term as equivalent to “black hat” or simply as equivalent to “informatic pirate”.

  8. Noel Torres "Envite" Says:

    milton Says:
    October 12th, 2008 at 1:28 pm

    [...]

    ==> How to solve the errors ? replace disk ? check tests ? ***What*** can we do ?

    SMART is just a technology for Monitoring and Analysis. If you see that your disc’s SMART indicates a problem in a pre-fail attribute, the best you can do is to substitute the disc as soon as possible. If it is in a RAID array with spare discs, activate a spare and offline the filing disc immediately. If it is, on the other hand, a single disc in a laptop, check if you have an actualized backup, first, and run to the vendor for a replacement disc, second.

    However, I use to substitute hard discs in advance when they start to show bad values in Old_age attributes, like 10000 in Start_Stop_Count (raw value) for a desktop, or 20000-30000 hours in Power_On_Seconds (depending on disc quality).

  9. Ben Says:

    I have used disks for a *long* time, i.e., years while seeing attributes read “Pre-fail”, so I’m not sure they *always* indicate imminent drive failure.

    What usually indicates imminent drive failure are the SMART drive health errors, such as “pending unreadable sectors” and the error which usually follows when the prediction is no longer pending, “Offline uncorrectable sector” which, IIRC, means that the drive has run out of its surplus of sectors to replace sectors which have gone bad, and will probably fail soon. If your disk isn’t full, then I think you won’t start losing data right away when said sectors become uncorrectable, but I wouldn’t try to fill up the disk, and would replace it within a few days. Just my experience.

  10. Noel Torres "Envite" Says:

    Pre-fail is a type of attribute. It just exists, and thus, its existence does not imply that the disc is about to fail. Its value does.

  11. Vid Says:

    The prefail value (see below) is very high on my disk. Is that bad?

    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0×000f 114 096 006 Pre-fail Always - 66517888
    3 Spin_Up_Time 0×0002 099 099 000 Old_age Always - 0
    4 Start_Stop_Count 0×0033 099 099 020 Pre-fail Always - 1421
    5 Reallocated_Sector_Ct 0×0033 100 100 036 Pre-fail Always - 0
    7 Seek_Error_Rate 0×000f 077 060 030 Pre-fail Always - 58436618
    9 Power_On_Hours 0×0032 097 097 000 Old_age Always - 2699
    10 Spin_Retry_Count 0×0013 100 100 034 Pre-fail Always - 0
    12 Power_Cycle_Count 0×0033 099 099 020 Pre-fail Always - 1647
    187 Reported_Uncorrect 0×0032 100 100 000 Old_age Always - 0
    189 High_Fly_Writes 0×003a 100 100 000 Old_age Always - 0
    190 Airflow_Temperature_Cel 0×0022 045 037 045 Old_age Always FAILING_NOW 55 (1 209 55 29)
    192 Power-Off_Retract_Count 0×0032 100 100 000 Old_age Always - 930
    193 Load_Cycle_Count 0×0032 001 001 000 Old_age Always - 207344
    194 Temperature_Celsius 0×0022 055 063 000 Old_age Always - 55 (0 23 0 0)
    195 Hardware_ECC_Recovered 0×001a 074 060 000 Old_age Always - 158164306
    197 Current_Pending_Sector 0×0012 100 100 000 Old_age Always - 1
    198 Offline_Uncorrectable 0×0010 100 100 000 Old_age Offline - 1
    199 UDMA_CRC_Error_Count 0×003e 200 200 000 Old_age Always - 0
    200 Multi_Zone_Error_Rate 0×0000 100 253 000 Old_age Offline - 0
    202 TA_Increase_Count 0×0032 100 253 000 Old_age Always - 0

  12. Toscalix Says:

    Congratulations for the article, Noel

  13. Noel Torres "Envite" Says:

    Thaks, Toscalix :)

  14. Noel Torres "Envite" Says:

    Vid: none of your pre-fail attributes is bad. All pre-fail attributes are better when higher. For example, Start_Stop_Count has a very good value of 99 and it can drop to 20 without being a problem.

    But your disc has a problem: it is at his Temperature limit. It should not work at more than 45ºC and it has reached that.

  15. Vid Says:

    @Noel: I get a HardDisk warning message very frequently but didnt find any solutions online. Is it possible to control the temperature at 45 deg ?

  16. Vid Says:

    @Noel: Thanks for the reply. I put this : sudo hdparm -B 254 /dev/sda , in my /etc/rcS.d file so it runs automatically each time the machine boots but it also gives a disk health warning i mentioned earlier. Strange :(

  17. Noel Torres "Envite" Says:

    Vid: hdparm -B 254 /dev/sda causes that your disk can NOT spin down nor make any kind of power management. Remember always that every single KiloWatt-hour you spend in your computer (or any piece of it, like a hard disc) is a KiloWatt-hour that turns into heat and a KiloWatt-hour you must manage with the cooling system. I strongly suggest against that kind of aggresive performance setting except for dedicated servers in 24×7 attended rooms.

    I suggest changing the hdparm value to 127 (or simply to delete that line) AND to better ventilate the disc.

  18. Lauri Says:

    I have found the software named HDSentinel (http://www.hdsentinel.com/hdslin.php) also very useful. It is an utility that interprets the SMART information of your hard disk drives and predicts an estimated lifetime.

  19. noname Says:

    @Lauri. Thanks for the tip.
    The data makes a bit more sense now.

  20. TidusBlade Says:

    Thanks for another great article :)
    This is much better than the tool I used on Windows, gave me all the info I ever needed.

    Hopefully this will help me in warning me if my hard disk is about to die.