测试Linux服务器SCSI/SATA硬盘是否正常

Test If Linux Server SCSI / SATA Hard Disk Going Bad

我们读者中的一个常客提到一个问题:

怎么测试我的硬盘是否出故障?我在/var/log/messages 文件中只能看到很少的错误

/var/log/messages 文件中的 I/O 错误表明硬盘出了一些故障甚至可能是挂掉。可以使用 smartctl 命令查看硬盘故障,这是Linux/Unix 类操作系统下对 SMART 硬盘的控制和监视工具。

smartctl 基于硬盘自检、分析和报告技术(SMART),该技术内置到很多 ATA-3(及其后来版本)、IDE、SCSI-3 硬盘驱动中。SMART的作用在与监测硬盘的可靠性和预测错误,同时展开不同类型的驱动自检。

服务器smartctl

smartctl 是一个命令行工具,旨在执行SMART任务比如:显示SMART自检和错误日志,启用和禁用SMART自动检测,开始设备自我测试。首先,确认BIOS中允许SMART支持。然后,运行如下命令查看你的硬盘是否支持SMART技术。

# smartctl -i /dev/sdb

启用 SMART,运行


# smartctl -s on -d ata /dev/sdb

样例输出:

smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce AllenHome page is http://smartmontools.sourceforge.net/=== START OF ENABLE/DISABLE COMMANDS div ===SMART Enabled.

运行整体状况和自我评价测试,输入

# smartctl -d ata -H /dev/sdb

样例输出:

smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce AllenHome page is http://smartmontools.sourceforge.net/=== START OF READ SMART DATA div ===SMART overall-health self-assessment test result: PASSED

一个不合格的硬盘输出样例:

smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce AllenHome page is http://smartmontools.sourceforge.net/=== START OF READ SMART DATA div ===SMART overall-health self-assessment test result: PASSEDPlease note the following marginal Attributes:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE190 Airflow_Temperature_Cel 0x0022   044   033   045    Old_age   Always   FAILING_NOW 56 (96 110 58 25)

下面的命令会对不合格的硬盘提供更多详细的信息:

# smartctl --attributes --log=selftest /dev/sda

样例输出:

smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce AllenHome page is http://smartmontools.sourceforge.net/=== START OF READ SMART DATA div ===SMART Attributes Data Structure revision number: 10Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE  1 Raw_Read_Error_Rate     0x000f   098   092   006    Pre-fail  Always       -       238320363  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       587  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       9  7 Seek_Error_Rate         0x000f   077   060   030    Pre-fail  Always       -       51672328  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4805 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       586184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       417188 Unknown_Attribute       0x0032   100   099   000    Old_age   Always       -       4295032833189 High_Fly_Writes         0x003a   094   094   000    Old_age   Always       -       6190 Airflow_Temperature_Cel 0x0022   044   033   045    Old_age   Always   FAILING_NOW 56 (96 122 58 25)194 Temperature_Celsius     0x0022   056   067   000    Old_age   Always       -       56 (0 23 0 0)195 Hardware_ECC_Recovered  0x001a   043   026   000    Old_age   Always       -       238320363197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       49198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       49199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       172082159686339241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       2155546016242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       3048586928SMART Self-test log structure revision number 1Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error# 1  Extended offline    Completed: read failure       90%      4789         1746972641

通过输入下面这条命令,你可以获得更多数据:

# smartctl -d ata -a /dev/sdb

输出:

smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce AllenHome page is http://smartmontools.sourceforge.net/=== START OF INFORMATION div ===Device Model:     WDC WD2500YS-01SHB0Serial Number:    WD-WCANY1729333Firmware Version: 20.06C03User Capacity:    251,000,193,024 bytesDevice is:        Not in smartctl database [for details use: -P showall]ATA Version is:   7ATA Standard is:  Exact ATA specification draft version not indicatedLocal Time is:    Wed Jul  4 15:04:38 2007 CDTSMART support is: Available - device has SMART capability.SMART support is: Enabled=== START OF READ SMART DATA div ===SMART overall-health self-assessment test result: PASSEDGeneral SMART Values:Offline data collection status:  (0x82) Offline data collection activity                                        was completed without error.                                        Auto Offline Data Collection: Enabled.Self-test execution status:      (   0) The previous self-test routine completed                                        without error or no self-test has ever                                        been run.Total time to complete Offlinedata collection:                 (7800) seconds.Offline data collectioncapabilities:                    (0x7b) SMART execute Offline immediate.                                        Auto Offline data collection on/off support.                                        Suspend Offline collection upon new                                        command.                                        Offline surface scan supported.                                        Self-test supported.                                        Conveyance Self-test supported.                                        Selective Self-test supported.SMART capabilities:            (0x0003) Saves SMART data before entering                                        power-saving mode.                                        Supports SMART auto save timer.Error logging capability:        (0x01) Error logging supported.                                        General Purpose Logging supported.Short self-test routinerecommended polling time:        (   2) minutes.Extended self-test routinerecommended polling time:        (  92) minutes.Conveyance self-test routinerecommended polling time:        (   6) minutes.SMART Attributes Data Structure revision number: 16Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0  3 Spin_Up_Time            0x0003   190   187   021    Pre-fail  Always       -       5500  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       24  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       6382 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0 11 Calibration_Retry_Count 0x0013   100   253   051    Pre-fail  Always       -       0 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       23194 Temperature_Celsius     0x0022   127   096   000    Old_age   Always       -       23196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0SMART Error Log Version: 1No Errors LoggedSMART Self-test log structure revision number 1No self-tests have been logged.  [To run self-tests, use: smartctl -t]SMART Selective self-test log data structure revision number 1 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS    1        0        0  Not_testing    2        0        0  Not_testing    3        0        0  Not_testing    4        0        0  Not_testing    5        0        0  Not_testingSelective self-test flags (0x0):  After scanning selected spans, do NOT read-scan remainder of disk.If Selective self-test is pending on power-up, resume after 0 minute delay.

RAID(磁盘阵列)控制器注意事项

查看 3ware SCSI RAID 控制器背后的的ATA硬盘语法是:

# smartctl -a -d 3ware,2 /dev/sda# smartctl -a -d 3ware,0 /dev/twe0

了解如何使用 smartctl 命令查看Adaptec RAID 和3ware SCSI RAID 背后的硬盘以获得更多信息

任务:硬盘的扩展自检

你需要对/dev/hdc 开始一个扩展的硬盘自检。你可以在一个运行的系统上执行这个命令。结果将会在自检日志中看到,当然是用’-l selftest’选项设置可见时。

# smartctl -d ata -t long /dev/sdb

损坏硬盘的细节报告样例:

# smartctl -a /dev/sda

样例输出:

smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce AllenHome page is http://smartmontools.sourceforge.net/=== START OF INFORMATION div ===Device Model:     ST31500341ASSerial Number:    9VS0TG4BFirmware Version: CC1HUser Capacity:    1,500,301,910,016 bytesDevice is:        Not in smartctl database [for details use: -P showall]ATA Version is:   8ATA Standard is:  ATA-8-ACS revision 4Local Time is:    Mon Oct 26 21:16:15 2009 ISTSMART support is: Available - device has SMART capability.SMART support is: Enabled=== START OF READ SMART DATA div ===SMART overall-health self-assessment test result: PASSEDSee vendor-specific Attribute list for marginal Attributes.General SMART Values:Offline data collection status:  (0x82)Offline data collection activitywas completed without error.Auto Offline Data Collection: Enabled.Self-test execution status:      (   0)The previous self-test routine completedwithout error or no self-test has everbeen run.Total time to complete Offlinedata collection:  ( 617) seconds.Offline data collectioncapabilities:  (0x7b) SMART execute Offline immediate.Auto Offline data collection on/off support.Suspend Offline collection upon newcommand.Offline surface scan supported.Self-test supported.Conveyance Self-test supported.Selective Self-test supported.SMART capabilities:            (0x0003)Saves SMART data before enteringpower-saving mode.Supports SMART auto save timer.Error logging capability:        (0x01)Error logging supported.General Purpose Logging supported.Short self-test routinerecommended polling time:  (   1) minutes.Extended self-test routinerecommended polling time:  ( 255) minutes.Conveyance self-test routinerecommended polling time:  (   2) minutes.SCT capabilities:        (0x103f)SCT Status supported.SCT Feature Control supported.SCT Data Table supported.SMART Attributes Data Structure revision number: 10Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE  1 Raw_Read_Error_Rate     0x000f   098   092   006    Pre-fail  Always       -       238338845  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       587  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       9  7 Seek_Error_Rate         0x000f   077   060   030    Pre-fail  Always       -       51672525  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4806 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       586184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       417188 Unknown_Attribute       0x0032   100   099   000    Old_age   Always       -       4295032833189 High_Fly_Writes         0x003a   094   094   000    Old_age   Always       -       6190 Airflow_Temperature_Cel 0x0022   044   033   045    Old_age   Always   FAILING_NOW 56 (96 126 58 25)194 Temperature_Celsius     0x0022   056   067   000    Old_age   Always       -       56 (0 23 0 0)195 Hardware_ECC_Recovered  0x001a   043   026   000    Old_age   Always       -       238338845197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       49198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       49199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       107168023974595241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       2155546480242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       3048590512SMART Error Log Version: 1ATA Error Count: 416 (device log contains only the most recent five errors)CR = Command Register [HEX]FR = Features Register [HEX]SC = Sector Count Register [HEX]SN = Sector Number Register [HEX]CL = Cylinder Low Register [HEX]CH = Cylinder High Register [HEX]DH = Device/Head Register [HEX]DC = Device Command Register [HEX]ER = Error register [HEX]ST = Status register [HEX]Powered_Up_Time is measured from power on, and printed asDDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,SS=sec, and sss=millisec. It "wraps" after 49.710 days.Error 416 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  25 00 08 ff ff ff ef 00      00:55:03.917  READ DMA EXT  27 00 00 00 00 00 e0 00      00:55:03.818  READ NATIVE MAX ADDRESS EXT  ec 00 00 00 00 00 a0 00      00:55:03.798  IDENTIFY DEVICE  ef 03 46 00 00 00 a0 00      00:55:03.779  SET FEATURES [Set transfer mode]  27 00 00 00 00 00 e0 00      00:55:03.658  READ NATIVE MAX ADDRESS EXTError 415 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  25 00 08 ff ff ff ef 00      00:55:00.927  READ DMA EXT  27 00 00 00 00 00 e0 00      00:55:00.837  READ NATIVE MAX ADDRESS EXT  ec 00 00 00 00 00 a0 00      00:55:00.817  IDENTIFY DEVICE  ef 03 46 00 00 00 a0 00      00:55:00.800  SET FEATURES [Set transfer mode]  27 00 00 00 00 00 e0 00      00:55:00.747  READ NATIVE MAX ADDRESS EXTError 414 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  25 00 08 ff ff ff ef 00      00:54:57.903  READ DMA EXT  27 00 00 00 00 00 e0 00      00:54:57.807  READ NATIVE MAX ADDRESS EXT  ec 00 00 00 00 00 a0 00      00:54:57.787  IDENTIFY DEVICE  ef 03 46 00 00 00 a0 00      00:54:57.757  SET FEATURES [Set transfer mode]  27 00 00 00 00 00 e0 00      00:54:57.637  READ NATIVE MAX ADDRESS EXTError 413 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  25 00 08 ff ff ff ef 00      00:54:54.862  READ DMA EXT  27 00 00 00 00 00 e0 00      00:54:54.767  READ NATIVE MAX ADDRESS EXT  ec 00 00 00 00 00 a0 00      00:54:54.746  IDENTIFY DEVICE  ef 03 46 00 00 00 a0 00      00:54:54.728  SET FEATURES [Set transfer mode]  27 00 00 00 00 00 e0 00      00:54:54.677  READ NATIVE MAX ADDRESS EXTError 412 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  25 00 08 ff ff ff ef 00      00:54:51.838  READ DMA EXT  27 00 00 00 00 00 e0 00      00:54:51.736  READ NATIVE MAX ADDRESS EXT  ec 00 00 00 00 00 a0 00      00:54:51.716  IDENTIFY DEVICE  ef 03 46 00 00 00 a0 00      00:54:51.685  SET FEATURES [Set transfer mode]  27 00 00 00 00 00 e0 00      00:54:51.566  READ NATIVE MAX ADDRESS EXTSMART Self-test log structure revision number 1Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error# 1  Extended offline    Completed: read failure       90%      4789         1746972641SMART Selective self-test log data structure revision number 1 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS    1        0        0  Not_testing    2        0        0  Not_testing    3        0        0  Not_testing    4        0        0  Not_testing    5        0        0  Not_testingSelective self-test flags (0x0):  After scanning selected spans, do NOT read-scan remainder of disk.If Selective self-test is pending on power-up, resume after 0 minute delay.

从备份中恢复

如果其中一个测试报告错误,更换硬盘并且将数据从备份中恢复

在服务器上安装 smartd 来接收发现问题时的警告邮件

smartd 是一个监测硬盘的守护进程,并且它会试图启用SMART 监测硬盘。它会每隔30分钟(可配置选项)检测硬盘的健康数据和SCSI设备。它通过 SYSLOG界面记录SMART错误和属性。这些SYSLOG通知和警告的默认位置是依赖于系统的(通常是 /var/log/messages或 /var/log/syslog)。smartd除了记录到一个文件中,也可以被配置为检测到错误时发送电子邮件警告。基于错误的不同类型,你可能需要运行盘上的自检程序,备份磁盘,更换硬盘或者使用制造商的程序,迫使坏或无法读取磁盘扇区的重新分配。更多内容请查看安装和配置smartd

Gnome 磁盘实用工具

大多数类unix系统比如FreeBSD、OpenBSD 都附带有叫做磁盘的图形工具。它只会在你运行带有gnome的台式和笔记本系统时才工作。访问磁盘工具:

Applications > System Tools > Disk Utility

点击硬盘:

点击smart data 查看详情:

一个健康硬盘的例子:

问候 GSmartControl

GSmartControll是一个硬盘健康视察工具,是 smartctl命令的图形界面。有如下特点:

1、自动报告并且高亮所有异常情况;

2、可以启用/禁用 SMART;

3、允许启用/禁用自动离线数据采集 — 驱动器将每4小时执行一个简短的自检程序并不对性能产生影响;

4、只是对 smartctl 的全局和每个驱动选项的配置

5、显示 SMART 自检

6、显示驱动器特性信息:容量、属性和自检日志

7、可以从一个保存文件中读出 smartctl的输出,并把它解释为一个虚拟设备

8、能在大多数支持smartctl的操作系统上工作,如* BSD和Linux的各种发行版

9、有海量的帮助信息

在Debian或Ubuntu你可以使用apt-get命令如下安装:

$ sudo apt-get install gsmartcontrol

在Fedora、CentOS或Real中用yum命令效果相同:

# yum install gsmartcontrol

样例输出:

点击硬盘以查看更过信息:

参考资料:

* smartctl 帮助文档

*在Linux或Unix下用smartd 监测硬盘状况点击打开链接

旅行,其实是需要具有一些流浪精神的,

测试Linux服务器SCSI/SATA硬盘是否正常

相关文章:

你感兴趣的文章:

标签云: