Identifying a Failed SSD Drive
With EC enabled, the Lightbits software allows IOs to continue without interruption in the case of an SSD failure. There are three drive status values when troubleshooting drive failures.
Drive Status | Description |
---|---|
Healthy | The SSD is functioning properly. |
Rebuilding | The SSD has failed and data reconstruction is in progress. |
Failed | Data reconstruction has completed. You can remove the failed SSD and insert a new SSD. |
- Check the nvme devices status by entering the
lbcli list nvme-devices
command to see if any SSD has failed and is in an EC rebuilding process.
Sample Command
$ lbcli list nvme-devices | egrep "Failed|Rebuilding"
A -J flag after lbcli indicates that the JWT is not stored in the lbcli configuration file.
Sample Output
Name Size Serial State Server UUID Node UUID
nvme1n1 4T 7000SZ450RGN Rebuilding fc9849f7-7380-48d8-904b-48a89c4da7a0 fb9c7b76-22f2-4fcf-af1a-70933c1dd3fb
In this example, the output shows one NVMe SSD which has failed and is now during data reconstruction.
Since this example does not use the --node-uuid or --server-uuid flags, the output shows all of the failed NVMe SSDs across the entire cluster. You can filter for specific nodes or servers using these flags. Once the data reconstruction is complete and the SSD state changes to Failed, the SSD is no longer managed by any node and is not associated with a node UUID.
- To monitor a failed SSD’s rebuild progress, use the
lbcli get node
command with the --node-uuid flag for the Lightbits node that is managing the failed NVMe SSD.
Sample Command
$ lbcli get node --uuid=fb9c7b76-22f2-4fcf-af1a-70933c1dd3fb
A -J flag after lbcli indicates that the JWT is not stored in the lbcli configuration file.
Sample Output
UUID: fb9c7b76-22f2-4fcf-af1a-70933c1dd3fb
clusterManagerMode: PassiveMode
ec: true
failureDomains: - rack08
hostname: rack08-server62
maxNvmeDevices: 12
name: server02-0
nvmeEndpoint: 10.17.51.7:4420
state: Active
inLocalRebuild: true
localRebuildProgress: 53
- Recheck the NVMe devices’ status with the
lbcli list nvme-devices
command to see if the status has changed from Rebuilding to Failed for the failed SSD. If the status is changed, the rebuild process is complete.
Sample Command
$ lbcli list nvme-devices --node-uuid=fb9c7b76-22f2-4fcf-af1a-70933c1dd3fb
A -J flag after lbcli indicates that the JWT is not stored in the lbcli configuration file.
Sample Output
Name Size Serial State Server UUID Node UUID
nvme0n1 4T 700084450RGN Healthy fc9849f7-7380-48d8-904b-48a89c4da7a0 fb9c7b76-22f2-4fcf-af1a-70933c1dd3fb nvme1n1 4T 7000SZ450RGN Failed fc9849f7-7380-48d8-904b-48a89c4da7a0 ---
nvme2n1 4T 7000UM450RGN Healthy fc9849f7-7380-48d8-904b-48a89c4da7a0 fb9c7b76-22f2-4fcf-af1a-70933c1dd3fb nvme3n1 4T 7000V9450RGN Healthy fc9849f7-7380-48d8-904b-48a89c4da7a0 fb9c7b76-22f2-4fcf-af1a-70933c1dd3fb nvme4n1 4T 910005450RGN Healthy fc9849f7-7380-48d8-904b-48a89c4da7a0 fb9c7b76-22f2-4fcf-af1a-70933c1dd3fb
To replace the failed device, follow the steps for Adding an NVMe SSD to a Lightbits Storage Server.