Test If Linux Server SCSI / SATA Hard Disk Going Bad
我们读者中的一个常客提到一个问题:
怎么测试我的硬盘是否出故障?我在/var/log/messages 文件中只能看到很少的错误
/var/log/messages 文件中的 I/O 错误表明硬盘出了一些故障甚至可能是挂掉。可以使用 smartctl 命令查看硬盘故障,这是Linux/Unix 类操作系统下对 SMART 硬盘的控制和监视工具。
smartctl 基于硬盘自检、分析和报告技术(SMART),该技术内置到很多 ATA-3(及其后来版本)、IDE、SCSI-3 硬盘驱动中。SMART的作用在与监测硬盘的可靠性和预测错误,同时展开不同类型的驱动自检。
服务器smartctl
smartctl 是一个命令行工具,旨在执行SMART任务比如:显示SMART自检和错误日志,启用和禁用SMART自动检测,开始设备自我测试。首先,确认BIOS中允许SMART支持。然后,运行如下命令查看你的硬盘是否支持SMART技术。
# smartctl -i /dev/sdb
启用 SMART,运行
# smartctl -s on -d ata /dev/sdb
样例输出:
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce AllenHome page is http://smartmontools.sourceforge.net/=== START OF ENABLE/DISABLE COMMANDS div ===SMART Enabled.
运行整体状况和自我评价测试,输入
# smartctl -d ata -H /dev/sdb
样例输出:
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce AllenHome page is http://smartmontools.sourceforge.net/=== START OF READ SMART DATA div ===SMART overall-health self-assessment test result: PASSED
一个不合格的硬盘输出样例:
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce AllenHome page is http://smartmontools.sourceforge.net/=== START OF READ SMART DATA div ===SMART overall-health self-assessment test result: PASSEDPlease note the following marginal Attributes:ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE190 Airflow_Temperature_Cel 0x0022 044 033 045 Old_age Always FAILING_NOW 56 (96 110 58 25)
下面的命令会对不合格的硬盘提供更多详细的信息:
# smartctl --attributes --log=selftest /dev/sda
样例输出:
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce AllenHome page is http://smartmontools.sourceforge.net/=== START OF READ SMART DATA div ===SMART Attributes Data Structure revision number: 10Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 098 092 006 Pre-fail Always - 238320363 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 587 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 9 7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail Always - 51672328 9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4805 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 586184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 417188 Unknown_Attribute 0x0032 100 099 000 Old_age Always - 4295032833189 High_Fly_Writes 0x003a 094 094 000 Old_age Always - 6190 Airflow_Temperature_Cel 0x0022 044 033 045 Old_age Always FAILING_NOW 56 (96 122 58 25)194 Temperature_Celsius 0x0022 056 067 000 Old_age Always - 56 (0 23 0 0)195 Hardware_ECC_Recovered 0x001a 043 026 000 Old_age Always - 238320363197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 49198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 49199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 172082159686339241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 2155546016242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 3048586928SMART Self-test log structure revision number 1Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error# 1 Extended offline Completed: read failure 90% 4789 1746972641
通过输入下面这条命令,你可以获得更多数据:
# smartctl -d ata -a /dev/sdb
输出:
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce AllenHome page is http://smartmontools.sourceforge.net/=== START OF INFORMATION div ===Device Model: WDC WD2500YS-01SHB0Serial Number: WD-WCANY1729333Firmware Version: 20.06C03User Capacity: 251,000,193,024 bytesDevice is: Not in smartctl database [for details use: -P showall]ATA Version is: 7ATA Standard is: Exact ATA specification draft version not indicatedLocal Time is: Wed Jul 4 15:04:38 2007 CDTSMART support is: Available - device has SMART capability.SMART support is: Enabled=== START OF READ SMART DATA div ===SMART overall-health self-assessment test result: PASSEDGeneral SMART Values:Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled.Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run.Total time to complete Offlinedata collection: (7800) seconds.Offline data collectioncapabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported.SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer.Error logging capability: (0x01) Error logging supported. General Purpose Logging supported.Short self-test routinerecommended polling time: ( 2) minutes.Extended self-test routinerecommended polling time: ( 92) minutes.Conveyance self-test routinerecommended polling time: ( 6) minutes.SMART Attributes Data Structure revision number: 16Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 190 187 021 Pre-fail Always - 5500 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 24 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6382 10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 23194 Temperature_Celsius 0x0022 127 096 000 Old_age Always - 23196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0SMART Error Log Version: 1No Errors LoggedSMART Self-test log structure revision number 1No self-tests have been logged. [To run self-tests, use: smartctl -t]SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testingSelective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk.If Selective self-test is pending on power-up, resume after 0 minute delay.
RAID(磁盘阵列)控制器注意事项
查看 3ware SCSI RAID 控制器背后的的ATA硬盘语法是:
# smartctl -a -d 3ware,2 /dev/sda# smartctl -a -d 3ware,0 /dev/twe0
了解如何使用 smartctl 命令查看Adaptec RAID 和3ware SCSI RAID 背后的硬盘以获得更多信息
任务:硬盘的扩展自检
你需要对/dev/hdc 开始一个扩展的硬盘自检。你可以在一个运行的系统上执行这个命令。结果将会在自检日志中看到,当然是用’-l selftest’选项设置可见时。
# smartctl -d ata -t long /dev/sdb
损坏硬盘的细节报告样例:
# smartctl -a /dev/sda
样例输出:
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce AllenHome page is http://smartmontools.sourceforge.net/=== START OF INFORMATION div ===Device Model: ST31500341ASSerial Number: 9VS0TG4BFirmware Version: CC1HUser Capacity: 1,500,301,910,016 bytesDevice is: Not in smartctl database [for details use: -P showall]ATA Version is: 8ATA Standard is: ATA-8-ACS revision 4Local Time is: Mon Oct 26 21:16:15 2009 ISTSMART support is: Available - device has SMART capability.SMART support is: Enabled=== START OF READ SMART DATA div ===SMART overall-health self-assessment test result: PASSEDSee vendor-specific Attribute list for marginal Attributes.General SMART Values:Offline data collection status: (0x82)Offline data collection activitywas completed without error.Auto Offline Data Collection: Enabled.Self-test execution status: ( 0)The previous self-test routine completedwithout error or no self-test has everbeen run.Total time to complete Offlinedata collection: ( 617) seconds.Offline data collectioncapabilities: (0x7b) SMART execute Offline immediate.Auto Offline data collection on/off support.Suspend Offline collection upon newcommand.Offline surface scan supported.Self-test supported.Conveyance Self-test supported.Selective Self-test supported.SMART capabilities: (0x0003)Saves SMART data before enteringpower-saving mode.Supports SMART auto save timer.Error logging capability: (0x01)Error logging supported.General Purpose Logging supported.Short self-test routinerecommended polling time: ( 1) minutes.Extended self-test routinerecommended polling time: ( 255) minutes.Conveyance self-test routinerecommended polling time: ( 2) minutes.SCT capabilities: (0x103f)SCT Status supported.SCT Feature Control supported.SCT Data Table supported.SMART Attributes Data Structure revision number: 10Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 098 092 006 Pre-fail Always - 238338845 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 587 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 9 7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail Always - 51672525 9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4806 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 586184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 417188 Unknown_Attribute 0x0032 100 099 000 Old_age Always - 4295032833189 High_Fly_Writes 0x003a 094 094 000 Old_age Always - 6190 Airflow_Temperature_Cel 0x0022 044 033 045 Old_age Always FAILING_NOW 56 (96 126 58 25)194 Temperature_Celsius 0x0022 056 067 000 Old_age Always - 56 (0 23 0 0)195 Hardware_ECC_Recovered 0x001a 043 026 000 Old_age Always - 238338845197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 49198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 49199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 107168023974595241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 2155546480242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 3048590512SMART Error Log Version: 1ATA Error Count: 416 (device log contains only the most recent five errors)CR = Command Register [HEX]FR = Features Register [HEX]SC = Sector Count Register [HEX]SN = Sector Number Register [HEX]CL = Cylinder Low Register [HEX]CH = Cylinder High Register [HEX]DH = Device/Head Register [HEX]DC = Device Command Register [HEX]ER = Error register [HEX]ST = Status register [HEX]Powered_Up_Time is measured from power on, and printed asDDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,SS=sec, and sss=millisec. It "wraps" after 49.710 days.Error 416 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 00:55:03.917 READ DMA EXT 27 00 00 00 00 00 e0 00 00:55:03.818 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:55:03.798 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:55:03.779 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 00:55:03.658 READ NATIVE MAX ADDRESS EXTError 415 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 00:55:00.927 READ DMA EXT 27 00 00 00 00 00 e0 00 00:55:00.837 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:55:00.817 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:55:00.800 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 00:55:00.747 READ NATIVE MAX ADDRESS EXTError 414 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 00:54:57.903 READ DMA EXT 27 00 00 00 00 00 e0 00 00:54:57.807 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:54:57.787 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:54:57.757 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 00:54:57.637 READ NATIVE MAX ADDRESS EXTError 413 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 00:54:54.862 READ DMA EXT 27 00 00 00 00 00 e0 00 00:54:54.767 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:54:54.746 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:54:54.728 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 00:54:54.677 READ NATIVE MAX ADDRESS EXTError 412 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 00:54:51.838 READ DMA EXT 27 00 00 00 00 00 e0 00 00:54:51.736 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:54:51.716 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:54:51.685 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 00:54:51.566 READ NATIVE MAX ADDRESS EXTSMART Self-test log structure revision number 1Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error# 1 Extended offline Completed: read failure 90% 4789 1746972641SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testingSelective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk.If Selective self-test is pending on power-up, resume after 0 minute delay.
从备份中恢复
如果其中一个测试报告错误,更换硬盘并且将数据从备份中恢复
在服务器上安装 smartd 来接收发现问题时的警告邮件
smartd 是一个监测硬盘的守护进程,并且它会试图启用SMART 监测硬盘。它会每隔30分钟(可配置选项)检测硬盘的健康数据和SCSI设备。它通过 SYSLOG界面记录SMART错误和属性。这些SYSLOG通知和警告的默认位置是依赖于系统的(通常是 /var/log/messages或 /var/log/syslog)。smartd除了记录到一个文件中,也可以被配置为检测到错误时发送电子邮件警告。基于错误的不同类型,你可能需要运行盘上的自检程序,备份磁盘,更换硬盘或者使用制造商的程序,迫使坏或无法读取磁盘扇区的重新分配。更多内容请查看安装和配置smartd
Gnome 磁盘实用工具
大多数类unix系统比如FreeBSD、OpenBSD 都附带有叫做磁盘的图形工具。它只会在你运行带有gnome的台式和笔记本系统时才工作。访问磁盘工具:
Applications > System Tools > Disk Utility
点击硬盘:
点击smart data 查看详情:
一个健康硬盘的例子:
问候 GSmartControl
GSmartControll是一个硬盘健康视察工具,是 smartctl命令的图形界面。有如下特点:
1、自动报告并且高亮所有异常情况;
2、可以启用/禁用 SMART;
3、允许启用/禁用自动离线数据采集 — 驱动器将每4小时执行一个简短的自检程序并不对性能产生影响;
4、只是对 smartctl 的全局和每个驱动选项的配置
5、显示 SMART 自检
6、显示驱动器特性信息:容量、属性和自检日志
7、可以从一个保存文件中读出 smartctl的输出,并把它解释为一个虚拟设备
8、能在大多数支持smartctl的操作系统上工作,如* BSD和Linux的各种发行版
9、有海量的帮助信息
在Debian或Ubuntu你可以使用apt-get命令如下安装:
$ sudo apt-get install gsmartcontrol
在Fedora、CentOS或Real中用yum命令效果相同:
# yum install gsmartcontrol
样例输出:
点击硬盘以查看更过信息:
参考资料:
* smartctl 帮助文档
*在Linux或Unix下用smartd 监测硬盘状况点击打开链接
旅行,其实是需要具有一些流浪精神的,