One of the packages I manually install in every new installation is smartmontools. I’ve some expertise in managing computers and networks, and it is a fact that pirate hackers and software bugs are not the main cause of problems in small and medium installations. Hardware is.
Thus, you have hardware that can fail, and Murphy says that if it can fail, it will. The point is not to avoid hardware failures, which would be impossible, but to detect them early or even prevent them.
Particularly for hard disks, the tool in charge is
smartctl from the package smartmontools. IDE disks (if they’re not of the age of dinosaurs) have an integrated self-testing tool called SMART which means “Self-Monitoring, Analysis and Reporting Technology”. Modern SCSI disks have it too if they’re SCSI 3 or newer. It happens that inside the disk chipset there are routines to check parameters of disk health: spin-up time, number of read failures, temperature, life elapsed… And all of those parameters are not only registered by the disk chipset, but they have designated security limits and both parameters and limits can be checked by software who access the disk using the appropriate I/O instructions.
And that software is
smartctl, a piece of the smartmontools deb package. Of course, since they access the disk in a raw way, you need to be root to use these commands.
smartctl can ask the disk for its smart identification:
# smartctl -i /dev/sda smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Fujitsu MHV series Device Model: FUJITSU MHV2060BH Serial Number: NW10T652991F Firmware Version: 00850028 User Capacity: 60,011,642,880 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a Local Time is: Mon May 12 02:39:31 2008 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled
smartctl can ask the disk for its parameter values:
# smartctl -A /dev/sda smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0×000f 100 100 046 Pre-fail Always - 124253 2 Throughput_Performance 0×0004 100 100 000 Old_age Offline - 18284544 3 Spin_Up_Time 0×0003 100 100 025 Pre-fail Always - 0 4 Start_Stop_Count 0×0032 099 099 000 Old_age Always - 1199 5 Reallocated_Sector_Ct 0×0033 100 100 024 Pre-fail Always - 8589934592000 7 Seek_Error_Rate 0×000e 100 087 000 Old_age Always - 1761 8 Seek_Time_Performance 0×0004 100 100 000 Old_age Offline - 0 9 Power_On_Seconds 0×0032 079 079 000 Old_age Always - 10866h+57m+47s 10 Spin_Retry_Count 0×0012 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0×0032 100 100 000 Old_age Always - 1199 192 Power-Off_Retract_Count 0×0032 099 099 000 Old_age Always - 283 193 Load_Cycle_Count 0×0032 100 100 000 Old_age Always - 6953 194 Temperature_Celsius 0×0022 100 100 000 Old_age Always - 45 (Lifetime Min/Max 14/58) 195 Hardware_ECC_Recovered 0×001a 100 100 000 Old_age Always - 62 196 Reallocated_Event_Count 0×0032 100 100 000 Old_age Always - 459276288 197 Current_Pending_Sector 0×0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0×0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0×003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0×000e 100 082 000 Old_age Always - 22371 203 Run_Out_Cancel 0×0002 100 100 000 Old_age Always - 1533257648465 240 Head_Flying_Hours 0×003e 200 200 000 Old_age Always - 0
As you can see, there are some attributes marked as “Pre-fail”. If any of these attributes goes beyond its threshold, the disk is about to fail in hours, maybe minutes.
Even if there are more options to
smartctl , the last ones I will comment here are -a and -t.
smartctl -t launches a disk test. It needs a parameter indicating the type of the test, and in the longest case it can last for tens of minutes and will check the electrical and mechanical performance as well as the read performance of the disk, going through all its surface. smartctl -a, in its turn, shows all available information about the disk, including self testing results. Since tests will span minutes or tens of minutes, we can not see them happening. All what we will get when launching tests is like:
# smartctl -t long /dev/sda smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: “Execute SMART Extended self-test routine immediately in off-line mode”. Drive command “Execute SMART Extended self-test routine immediately in off-line mode” successful. Testing has begun. Please wait 41 minutes for test to complete. Test will complete after Mon May 12 05:44:03 2008 Use smartctl -X to abort test.
Here, we’re being informed that (maybe) we will get a slightly lower performance on the disk for the next 41 minutes, since the test has started. It is completely background, or better ‘underground’, since it does not happen under the kernel control at all: everything is happening internally to the disk, and all what we can get is the result.
smartctl -a, in turn, show a very large amount of SMART information about the disk: almost all stored SMART information parsed for us. It is usually better to use a more specific switch, see the man page for details.
Finally, I want to comment that there is a daemon in the smartmontools package,
smartd, who can take care of doing tests for you. It works by running
smartctl in a periodic way (typically every 30 minutes) and logging all errors and parameter value changes to the syslog. The default configuration in Debian will also mail root if there’s any problem detected. I will not explain here about it, because I want you to read its (short and easy) documentation, but remember that in order to use it you must enable it in