|
Contents
This is the third article in a series about the SCSI DISK FMA project:
Overview
When a disk is faulted, the FMA I/O Retire Agent is triggered. The resulting behavior is subtle and can cause confusion if you ignore FMA messages and are not aware of running fmadm faulty. Retired faulted disks seem to "disappear" as seen by utilities like format (1M), and for an in-use disk this behavior might not occur until after a reboot.
Also, by default the telemetry associated with disk problems is now recorded in the fma error log (for example, fmdump -e) instead of /var/adm/messages. For more information, see SCSI DISK FMA Project Part 1: SCSI Device Drivers as FMA Telemetry Detectors.
There are two types of disk faults that can be diagnosed: DISK-8000-3E (hardware error) and DISK-8000-4Q (medium error). If you see a warning message on the console with either of these SUNW-MSG-IDs, you can find help by accessing a URL in the following format:
http://www.sun.com/msg/SUNW-MSG-ID
|
where SUNW-MSG-ID is the message ID, in this case either DISK-8000-3E or DISK-8000-4Q.
DISK-8000-3E
This message indicates that the Solaris Fault Manager has received reports from the Kernel SCSI disk driver (SD) that a disk hardware failure occurred.
The recommended service action for this event is to schedule replacement of the affected disk drive at the earliest possible convenience. Although the disk drive might be functioning, it is neither intended nor recommended that the faulted disk drive remain in the system for a prolonged period of time.
Follow these steps to complete the recommended repair action.
Step 1: Find the 36-character UUID (EVENT-ID) string that is associated with the fault.
This string can be located using several methods. Use either the fmdump (1M) or fmadm (1M) command shown in Example 1, or extract the UUID from the fault message displayed in the console output at the time of the fault.
Example 1: Finding the UUID (36-Character String)
[console]
Oct 13 17:04:25 icecube fmd: [ID 441519 daemon.error] SUNW-MSG-ID: DISK-8000-3E, TYPE: Fault, VER: 1, SEVERITY: Critical
...
Oct 13 17:04:25 icecube EVENT-ID: 75b3ef98-4210-e659-d339-a54044d858e7
[terminal]
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Oct 13 17:04:25 75b3ef98-4210-e659-d339-a54044d858e7 DISK-8000-3E Critical
# fmdump
TIME UUID SUNW-MSG-ID
Oct 13 17:04:25.7587 75b3ef98-4210-e659-d339-a54044d858e7 DISK-8000-3E
|
Step 2: Use the command fmadm faulty or fmdump -v -u UUID to locate the faulted disk. See Example 2.
Example 2: Determining Which FRU (Disk Drive) Needs to Be Replaced
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Oct 13 17:04:25 75b3ef98-4210-e659-d339-a54044d858e7 DISK-8000-3E Critical
Fault class : fault.io.scsi.cmd.disk.dev.rqs.derr
Affects : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJM7EED
//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@2,0
faulted and taken out of service
FRU : "HD_ID_34" (hc://:product-id=Sun-Fire-X4500:
chassis-id=00-14-4F-20-E3-08:server-id=icecube:
serial=KRVN63ZAJM7EED:part=HITACHI-HDS7250SASUN500G-0633KM7EED:
revision=K2AOAJ0A/chassis=0/bay=34/disk=0)
faulty
# fmdump -v -u 75b3ef98-4210-e659-d339-a54044d858e7
TIME UUID SUNW-MSG-ID
Oct 13 17:04:25.7587 75b3ef98-4210-e659-d339-a54044d858e7 DISK-8000-3E
100% fault.io.scsi.cmd.disk.dev.rqs.derr
Problem in: hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:
server-id=icecube:serial=KRVN63ZAJM7EED:
part=HITACHI-HDS7250SASUN500G-0633KM7EED:
revision=K2AOAJ0A/chassis=0/bay=34/disk=0
Affects: dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJM7EED
//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@2,0
FRU: hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:
server-id=icecube:serial=KRVN63ZAJM7EED:
part=HITACHI-HDS7250SASUN500G-0633KM7EED:
revision=K2AOAJ0A/chassis=0/bay=34/disk=0
Location: HD_ID_34
|
Step 3: Identify the FRU that needs to be replaced.
On supported platforms, the FRU contains an indentifier in the FRU label. In Example 2, the service action would be to replace the disk located in bay=34, because the HD_ID identified in the string is HD_ID_34. On platforms where this information isn't available, refer to platform-specific documents for identifying which physical location corresponds to the failed device.
Step 4: Replace the faulted FRU (repairing the faulted resource).
Refer to your specific platforms' hardware maintenance manual or service label for proper disk replacement procedures. These procedures include software manipulation to prepare the disk for replacement; for example umounting filesystems, disk management considerations, cfgadm commands, and so on.
Step 5: Manually run fmadm repair UUID to get the disk drive back.
# fmadm repair 75b3ef98-4210-e659-d339-a54044d858e7
fmadm: recorded repair to 75b3ef98-4210-e659-d339-a54044d858e7
|
Step 6: Verify that the repaired resource is no longer faulted.
After the disk drive is replaced, use the Solaris command fmadm faulty to display all faulted resources in the system. Confirm that the repaired resource is no longer listed as faulted, using the following code.
DISK-8000-4Q
The procedure for repairing the faulted device is almost the same as that used for DISK-8000-3E. Because a medium error can be caused by an error in the recorded data, we need to add one more step to locate the faulted lba on the - . In this case, add the following information to the previous Step 4.
Step 4, continued: You might be able to determine the logical block address that results in this fault by checking the output of fmdump -V -u UUID. See the following example code.
Example 3: Finding the lba of the Disk
# fmdump -eV -u bd09c30a-a84a-e518-bc2c-8f2108ac342d
...
lba = 0x12345678
|
For More Information
About the Authors
David Zhang is a Sun Software Engineer. His SCSI FMA team is working on disk/tape fault management projects based on the SCSI protocol. He has an M.S. in Computer Science from Harbin Institute of Technology.
Chris Horne is a Sun Senior Staff Engineer. His research interests include Solaris IO, the Storage software stack, and any innovations in operating systems.
|