Faulty NVMe Device

DescriptionFaulty NVMe DeviceVersion: 2.x
Symptoms

[root@rack12-server15 ~]# lbcli list nvme-devices --server-uuid 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e | sort

Name Size NUMA ID Serial State Server UUID Node UUID

nvme0n1 3.5 TiB 1 191723069582 Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2

nvme1n1 3.5 TiB 1 1917230681E6 Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2

nvme2n1 3.5 TiB 1 191723069952 Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2

nvme3n1 3.5 TiB 1 191723068215 Failed 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2

nvme4n1 3.5 TiB 1 191723066F0E Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2

nvme5n1 3.5 TiB 1 19172306A3F1 Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2

[root@rack12-server15 ~]# lbcli list nodes | sort

Name UUID State NVMe endpoint Failure domains Local rebuild progress

server00-0 ff4a7223-c131-5e8b-b499-c060afdda0f2 Active 10.17.233.1:4420 [server00] 6

server01-0 12017599-cda2-544b-bed9-cdda9cdd80b6 Active 10.17.233.2:4420 [server01] None

server02-0 99de8bed-4dbd-5aca-956f-ecac6dca7878 Active 10.17.233.3:4420 [server02] None

Troubleshooting Steps

The nvme list command output will show that the device is missing.

[root@rack12-server15 ~]# nvme list

Node SN Model Namespace Usage Format FW Rev

---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------

/dev/nvme0n1 191723069582 Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0

/dev/nvme1n1 1917230681E6 Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0

/dev/nvme2n1 191723069952 Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0

/dev/nvme4n1 191723066F0E Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0

/dev/nvme5n1 19172306A3F1 Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0

In the example above, you can see that nvme3n1 is missing.

Root CauseFailed NVMe device.

SSD Failure Handling

Lightbits storage handles NVMe SSD failure with Elastic Raid capability, protecting the data stored in the Lightbits storage instance with an N+1 Erasure Coding mechanism (with N+1 representing the total number of drives in the instance).

If one SSD is failed, or removed, this feature ensures that the storage service can continue.

In parallel, the Lightbits storage instance will start the “local rebuilding” to make the data become new “N’+1” protected again. In this case, N’ is actually now N-1 because one drive was removed. So essentially after a drive is removed or fails, it reprotects to ((N-1) + 1).

This feature can be enabled or disabled during the installation. Note also that after adding a drive in properly, it will reprotect back to N+1. The rebuild we are seeing is that protection.

If another drive fails after the rebuild, it will rebuild again to (N-2) + 1. Capacity lowers with each drive failure/removal reprotection, so we want to make sure we are not at usage capacity. Additionally, EC works with eight or more drives.

For additional information, see SSD Failure Handling.

If another SSD in the same Lightbits storage instance fails during the local rebuilding, this Lightbits storage instance will become inactive. However, at that level, it is protected by nodes.

Capacity Scale Up

The Lightbits storage cluster supports dynamically expanding the total physical capability per requirement. This is important for reducing the TCO by delaying the purchase until needed.

The capacity expansion can support scale up and scale out. Scale up refers to adding more NVMe SSDs to storage servers, while scale out refers to adding more storage servers for both capacity and performance.

For additional information, see Capacity Scale Up.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard