| Description | Faulty NVMe Device | Version: 2.x |
| Symptoms |
| |
| Troubleshooting Steps | The nvme list command output will show that the device is missing.
In the example above, you can see that nvme3n1 is missing. | |
| Root Cause | Failed NVMe device. |
Data SSD Failure Handling
Lightbits storage handles NVMe SSD failure with Elastic Raid capability, protecting the data stored in the Lightbits storage instance with an N+1 Erasure Coding mechanism (with N+1 representing the total number of drives in the instance).
If one SSD is failed, or removed, this feature ensures that the storage service can continue.
In parallel, the Lightbits storage instance will start the “local rebuilding” to make the data become new “N’+1” protected again. In this case, N’ is actually now N-1 because one drive was removed. So essentially after a drive is removed or fails, it reprotects to ((N-1) + 1).
This feature can be enabled or disabled during the installation. Note also that after adding a drive in properly, it will reprotect back to N+1. The rebuild we are seeing is that protection.
If another drive fails after the rebuild, it will rebuild again to (N-2) + 1. Capacity lowers with each drive failure/removal reprotection, so we want to make sure we are not at usage capacity. Additionally, EC works with eight or more drives.
For additional information, see SSD Failure Handling.
If another SSD in the same Lightbits storage instance fails during the local rebuilding, this Lightbits storage instance will become inactive. However, at that level, it is protected by nodes.
Capacity Scale-Up
The Lightbits storage cluster supports dynamically expanding the total physical capability per requirement. This is important for reducing the TCO by delaying the purchase until needed.
The capacity expansion can support scale up and scale out. Scale up refers to adding more NVMe SSDs to storage servers, while scale out refers to adding more storage servers for both capacity and performance.
For additional information, see Capacity Scale Up.
Journal SSD Failure Handling
If one of the NVMe devices being used for Journaling fails, the node will be marked as unhealthy and move to a state of “permanently failed”. This will automatically trigger the fail in place capability and the cluster will re-create 2x and 3x replicas in other healthy nodes. The nvme-device will also be marked as unhealthy when running the get/list nvme-devices API. In case of such a failure, you should contact Lightbits Support for additional assistance.
In a dual-instance deployment, a failure of a Journal NVMe device in one instance will not affect the other instance, which will continue to operate properly.
When a Journaling device fails, if the cluster can identify the failed disk, a NVMeDeviceFailed (JournalDeviceFailed) event will be generated. Otherwise, if it is a RAID failure, a NodeJournalingDeviceUnknownFailure will be generated. This is in addition to the NodeInactive event.