Faulty NVMe Device
Description | Faulty NVMe Device | Version: 2.x |
Symptoms |
| |
Troubleshooting Steps | The nvme list command output will show that the device is missing.
In the example above, you can see that nvme3n1 is missing. | |
Root Cause | Failed NVMe device. |
SSD Failure Handling
Lightbits storage handles NVMe SSD failure with Elastic Raid capability, protecting the data stored in the Lightbits storage instance with an N+1 Erasure Coding mechanism (with N+1 representing the total number of drives in the instance).
If one SSD is failed, or removed, this feature ensures that the storage service can continue.
In parallel, the Lightbits storage instance will start the “local rebuilding” to make the data become new “N’+1” protected again. In this case, N’ is actually now N-1 because one drive was removed. So essentially after a drive is removed or fails, it reprotects to ((N-1) + 1).
This feature can be enabled or disabled during the installation. Note also that after adding a drive in properly, it will reprotect back to N+1. The rebuild we are seeing is that protection.
If another drive fails after the rebuild, it will rebuild again to (N-2) + 1. Capacity lowers with each drive failure/removal reprotection, so we want to make sure we are not at usage capacity. Additionally, EC works with eight or more drives.
For additional information, see SSD Failure Handling.
If another SSD in the same Lightbits storage instance fails during the local rebuilding, this Lightbits storage instance will become inactive. However, at that level, it is protected by nodes.
Capacity Scale Up
The Lightbits storage cluster supports dynamically expanding the total physical capability per requirement. This is important for reducing the TCO by delaying the purchase until needed.
The capacity expansion can support scale up and scale out. Scale up refers to adding more NVMe SSDs to storage servers, while scale out refers to adding more storage servers for both capacity and performance.
For additional information, see Capacity Scale Up.