Faulty NVMe Device

DescriptionFaulty NVMe DeviceVersion: 2.x
Symptoms

[root@rack12-server15 ~]# lbcli list nvme-devices --server-uuid 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e | sort

Name Size NUMA ID Serial State Server UUID Node UUID

nvme0n1 3.5 TiB 1 191723069582 Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2

nvme1n1 3.5 TiB 1 1917230681E6 Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2

nvme2n1 3.5 TiB 1 191723069952 Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2

nvme3n1 3.5 TiB 1 191723068215 Failed 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2

nvme4n1 3.5 TiB 1 191723066F0E Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2

nvme5n1 3.5 TiB 1 19172306A3F1 Healthy 12cf6fe4-b4b4-500d-b6e3-12faaf41d52e ff4a7223-c131-5e8b-b499-c060afdda0f2

[root@rack12-server15 ~]# lbcli list nodes | sort

Name UUID State NVMe endpoint Failure domains Local rebuild progress

server00-0 ff4a7223-c131-5e8b-b499-c060afdda0f2 Active 10.17.233.1:4420 [server00] 6

server01-0 12017599-cda2-544b-bed9-cdda9cdd80b6 Active 10.17.233.2:4420 [server01] None

server02-0 99de8bed-4dbd-5aca-956f-ecac6dca7878 Active 10.17.233.3:4420 [server02] None

Troubleshooting Steps

The nvme list command output will show that the device is missing.

[root@rack12-server15 ~]# nvme list

Node SN Model Namespace Usage Format FW Rev

---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------

/dev/nvme0n1 191723069582 Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0

/dev/nvme1n1 1917230681E6 Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0

/dev/nvme2n1 191723069952 Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0

/dev/nvme4n1 191723066F0E Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0

/dev/nvme5n1 19172306A3F1 Micron_9300_MTFDHAL3T8TDP 1 3.84 TB / 3.84 TB 512 B + 0 B 11300DN0

In the example above, you can see that nvme3n1 is missing.

Root CauseFailed NVMe device.

Data SSD Failure Handling

Lightbits storage handles NVMe SSD failure with Elastic Raid capability, protecting the data stored in the Lightbits storage instance with an N+1 Erasure Coding mechanism (with N+1 representing the total number of drives in the instance).

If one SSD is failed, or removed, this feature ensures that the storage service can continue.

In parallel, the Lightbits storage instance will start the “local rebuilding” to make the data become new “N’+1” protected again. In this case, N’ is actually now N-1 because one drive was removed. So essentially after a drive is removed or fails, it reprotects to ((N-1) + 1).

This feature can be enabled or disabled during the installation. Note also that after adding a drive in properly, it will reprotect back to N+1. The rebuild we are seeing is that protection.

If another drive fails after the rebuild, it will rebuild again to (N-2) + 1. Capacity lowers with each drive failure/removal reprotection, so we want to make sure we are not at usage capacity. Additionally, EC works with eight or more drives.

For additional information, see SSD Failure Handling.

If another SSD in the same Lightbits storage instance fails during the local rebuilding, this Lightbits storage instance will become inactive. However, at that level, it is protected by nodes.

Capacity Scale-Up

The Lightbits storage cluster supports dynamically expanding the total physical capability per requirement. This is important for reducing the TCO by delaying the purchase until needed.

The capacity expansion can support scale up and scale out. Scale up refers to adding more NVMe SSDs to storage servers, while scale out refers to adding more storage servers for both capacity and performance.

For additional information, see Capacity Scale Up.

Journal SSD Failure Handling

If one of the NVMe devices being used for Journaling fails, the node will be marked as unhealthy and move to a state of “permanently failed”. This will automatically trigger the fail in place capability and the cluster will re-create 2x and 3x replicas in other healthy nodes. The nvme-device will also be marked as unhealthy when running the get/list nvme-devices API. In case of such a failure, you should contact Lightbits Support for additional assistance.

In a dual-instance deployment, a failure of a Journal NVMe device in one instance will not affect the other instance, which will continue to operate properly.

When a Journaling device fails, if the cluster can identify the failed disk, a NVMeDeviceFailed (JournalDeviceFailed) event will be generated. Otherwise, if it is a RAID failure, a NodeJournalingDeviceUnknownFailure will be generated. This is in addition to the NodeInactive event.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard