Faulty NVMe Device


Description	Faulty NVMe Device	Version: 2.x
Symptoms	`[root@rack12-server15 ~]# lbcli list nvme-devices --server-uuid 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e \| sort` `Name Size NUMA ID Serial State Server UUID Node UUID` `nvme0n1 3.5 TiB 1 191723069582 Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2` `nvme1n1 3.5 TiB 1 1917230681E6 Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2` `nvme2n1 3.5 TiB 1 191723069952 Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2` `nvme3n1 3.5 TiB 1 191723068215 Failed 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2` `nvme4n1 3.5 TiB 1 191723066F0E Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2` `nvme5n1 3.5 TiB 1 19172306A3F1 Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2` `[root@rack12-server15 ~]# lbcli list nodes \| sort` `Name UUID State NVMe endpoint Failure domains Local rebuild progress` `server00-0 ff4a7223-c131-5e8b-b499-c060afdda0f2 Active 10.17.233.1:4420 [server00] 6` `server01-0 12017599-cda2-544b-bed9-cdda9cdd80b6 Active 10.17.233.2:4420 [server01] None` `server02-0 99de8bed-4dbd-5aca-956f-ecac6dca7878 Active 10.17.233.3:4420 [server02] None`
Troubleshooting Steps	The nvme list command output will show that the device is missing. `[root@rack12-server15 ~]# nvme list` `Node SN Model Namespace Usage Format FW Rev` `---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------` `/dev/nvme0n1 191723069582 Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0` `/dev/nvme1n1 1917230681E6 Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0` `/dev/nvme2n1 191723069952 Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0` `/dev/nvme4n1 191723066F0E Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0` `/dev/nvme5n1 19172306A3F1 Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0` In the example above, you can see that nvme3n1 is missing.
Root Cause	Failed NVMe device.

SSD Failure Handling

Lightbits storage handles NVMe SSD failure with Elastic Raid capability, protecting the data stored in the Lightbits storage instance with an N+1 Erasure Coding mechanism (with N+1 representing the total number of drives in the instance).

If one SSD is failed, or removed, this feature ensures that the storage service can continue.

In parallel, the Lightbits storage instance will start the “local rebuilding” to make the data become new “N’+1” protected again. In this case, N’ is actually now N-1 because one drive was removed. So essentially after a drive is removed or fails, it reprotects to ((N-1) + 1).

This feature can be enabled or disabled during the installation. Note also that after adding a drive in properly, it will reprotect back to N+1. The rebuild we are seeing is that protection.

If another drive fails after the rebuild, it will rebuild again to (N-2) + 1. Capacity lowers with each drive failure/removal reprotection, so we want to make sure we are not at usage capacity. Additionally, EC works with eight or more drives.

For additional information, see SSD Failure Handling.

If another SSD in the same Lightbits storage instance fails during the local rebuilding, this Lightbits storage instance will become inactive. However, at that level, it is protected by nodes.

Capacity Scale Up

The Lightbits storage cluster supports dynamically expanding the total physical capability per requirement. This is important for reducing the TCO by delaying the purchase until needed.

The capacity expansion can support scale up and scale out. Scale up refers to adding more NVMe SSDs to storage servers, while scale out refers to adding more storage servers for both capacity and performance.

For additional information, see Capacity Scale Up.

Last updated on

Was this page helpful?