Sun Java Solaris Communities My SDN Account Join SDN

Article

SCSI DISK FMA Project Part 3: FMA Behavior of Retired Faulted SCSI Disks

 
By David Zhang and Chris Horne, December 2008  
Contents
 

This is the third article in a series about the SCSI DISK FMA project:

Overview

When a disk is faulted, the FMA I/O Retire Agent is triggered. The resulting behavior is subtle and can cause confusion if you ignore FMA messages and are not aware of running fmadm faulty. Retired faulted disks seem to "disappear" as seen by utilities like format (1M), and for an in-use disk this behavior might not occur until after a reboot.

Also, by default the telemetry associated with disk problems is now recorded in the fma error log (for example, fmdump -e) instead of /var/adm/messages. For more information, see SCSI DISK FMA Project Part 1: SCSI Device Drivers as FMA Telemetry Detectors.

There are two types of disk faults that can be diagnosed: DISK-8000-3E (hardware error) and DISK-8000-4Q (medium error). If you see a warning message on the console with either of these SUNW-MSG-IDs, you can find help by accessing a URL in the following format:

     http://www.sun.com/msg/SUNW-MSG-ID
 

where SUNW-MSG-ID is the message ID, in this case either DISK-8000-3E or DISK-8000-4Q.

DISK-8000-3E

This message indicates that the Solaris Fault Manager has received reports from the Kernel SCSI disk driver (SD) that a disk hardware failure occurred.

The recommended service action for this event is to schedule replacement of the affected disk drive at the earliest possible convenience. Although the disk drive might be functioning, it is neither intended nor recommended that the faulted disk drive remain in the system for a prolonged period of time.

Follow these steps to complete the recommended repair action.

Step 1: Find the 36-character UUID (EVENT-ID) string that is associated with the fault.

This string can be located using several methods. Use either the fmdump (1M) or fmadm (1M) command shown in Example 1, or extract the UUID from the fault message displayed in the console output at the time of the fault.

Example 1: Finding the UUID (36-Character String)

        [console]
        Oct 13 17:04:25 icecube fmd: [ID 441519 daemon.error] SUNW-MSG-ID: DISK-8000-3E, TYPE: Fault, VER: 1, SEVERITY: Critical
        ...
        Oct 13 17:04:25 icecube EVENT-ID: 75b3ef98-4210-e659-d339-a54044d858e7

        [terminal]
        # fmadm faulty
        --------------- ------------------------------------  -------------- ---------
        TIME            EVENT-ID                              MSG-ID         SEVERITY
        --------------- ------------------------------------  -------------- ---------
        Oct 13 17:04:25 75b3ef98-4210-e659-d339-a54044d858e7  DISK-8000-3E   Critical 

        # fmdump
        TIME                 UUID                                 SUNW-MSG-ID
        Oct 13 17:04:25.7587 75b3ef98-4210-e659-d339-a54044d858e7 DISK-8000-3E
 
 
Step 2: Use the command fmadm faulty or fmdump -v -u UUID to locate the faulted disk. See Example 2.
 

Example 2: Determining Which FRU (Disk Drive) Needs to Be Replaced

        # fmadm faulty
        --------------- ------------------------------------  -------------- ---------
        TIME            EVENT-ID                              MSG-ID         SEVERITY
        --------------- ------------------------------------  -------------- ---------
        Oct 13 17:04:25 75b3ef98-4210-e659-d339-a54044d858e7  DISK-8000-3E   Critical 

        Fault class : fault.io.scsi.cmd.disk.dev.rqs.derr
        Affects     : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJM7EED
                       //pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@2,0
                           faulted and taken out of service
        FRU         : "HD_ID_34" (hc://:product-id=Sun-Fire-X4500:
                 chassis-id=00-14-4F-20-E3-08:server-id=icecube:
              serial=KRVN63ZAJM7EED:part=HITACHI-HDS7250SASUN500G-0633KM7EED:
              revision=K2AOAJ0A/chassis=0/bay=34/disk=0)
                           faulty

        # fmdump -v -u 75b3ef98-4210-e659-d339-a54044d858e7
        TIME                 UUID                                 SUNW-MSG-ID
        Oct 13 17:04:25.7587 75b3ef98-4210-e659-d339-a54044d858e7 DISK-8000-3E
           100%  fault.io.scsi.cmd.disk.dev.rqs.derr

        Problem in: hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:
                server-id=icecube:serial=KRVN63ZAJM7EED:
            part=HITACHI-HDS7250SASUN500G-0633KM7EED:
            revision=K2AOAJ0A/chassis=0/bay=34/disk=0
           Affects: dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJM7EED
                //pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@2,0
              FRU: hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:
                server-id=icecube:serial=KRVN63ZAJM7EED:
            part=HITACHI-HDS7250SASUN500G-0633KM7EED:
            revision=K2AOAJ0A/chassis=0/bay=34/disk=0
        Location: HD_ID_34
 
 
Step 3: Identify the FRU that needs to be replaced.

On supported platforms, the FRU contains an indentifier in the FRU label. In Example 2, the service action would be to replace the disk located in bay=34, because the HD_ID identified in the string is HD_ID_34. On platforms where this information isn't available, refer to platform-specific documents for identifying which physical location corresponds to the failed device.

Step 4: Replace the faulted FRU (repairing the faulted resource).

Refer to your specific platforms' hardware maintenance manual or service label for proper disk replacement procedures. These procedures include software manipulation to prepare the disk for replacement; for example umounting filesystems, disk management considerations, cfgadm commands, and so on.

Step 5: Manually run fmadm repair UUID to get the disk drive back.
 
    # fmadm repair 75b3ef98-4210-e659-d339-a54044d858e7
    fmadm: recorded repair to 75b3ef98-4210-e659-d339-a54044d858e7

 
 
Step 6: Verify that the repaired resource is no longer faulted.

After the disk drive is replaced, use the Solaris command fmadm faulty to display all faulted resources in the system. Confirm that the repaired resource is no longer listed as faulted, using the following code.

        # fmadm faulty
 
 
DISK-8000-4Q

The procedure for repairing the faulted device is almost the same as that used for DISK-8000-3E. Because a medium error can be caused by an error in the recorded data, we need to add one more step to locate the faulted lba on the - . In this case, add the following information to the previous Step 4.

Step 4, continued: You might be able to determine the logical block address that results in this fault by checking the output of fmdump -V -u UUID. See the following example code.
 

Example 3: Finding the lba of the Disk

        # fmdump -eV -u bd09c30a-a84a-e518-bc2c-8f2108ac342d
    ...
    lba = 0x12345678
 
 
For More Information
 
About the Authors

David Zhang is a Sun Software Engineer. His SCSI FMA team is working on disk/tape fault management projects based on the SCSI protocol. He has an M.S. in Computer Science from Harbin Institute of Technology.

Chris Horne is a Sun Senior Staff Engineer. His research interests include Solaris IO, the Storage software stack, and any innovations in operating systems.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.