Lightbits Elastic Raid Implementation

Lightbits storage handles NVMe SSD failure with Elastic Raid capability, protecting the data stored in the Lightbits storage instance with an N+1 Erasure Coding mechanism (with N+1 representing the total number of drives in the instance).

If one SSD is failed, or removed, this feature ensures that the storage service can continue.

In parallel, the Lightbits storage instance will start the "local rebuilding" to make the data become new "N'+1" protected again. In this case, N' is actually now N-1 because one drive was removed. So essentially after a drive is removed or fails, it reprotects to ((N-1) + 1).

This feature can be enabled or disabled during the installation. Note also that after adding a drive in properly, it will reprotect back to N+1. The rebuild we are seeing is that protection.

If another drive fails after the rebuild, it will rebuild again to (N-2) + 1. Capacity lowers with each drive failure/removal reprotection, so we want to make sure we are not at usage capacity. Additionally, EC works with 8 or more drives.

Assuming you have 12 drives, this will be 11+1. If a drive fails, it will drop to 10+1. If another drive fails, it will drop to 9+1. This can go on as long as we have space and don't drop the sum below 8. However, at 10+1, if a drive is added, it will reprotect back up to 11+1.

If another SSD in the same Lightbits storage instance fails during the local rebuilding, this Lightbits storage instance will become inactive. However, at that level, it is protected by the nodes.

While the local rebuild is in progress, you will be vulnerable to SSD failures. You still have replication, so if a second NVMe fails while the rebuilding is still running from the first NVMe failure, Lightbits will consider the node as failed and then clients will fail-over and continue using replicas in the other nodes.

The node will then be considered as permanently failed, and affected volumes will be degraded. We will then trigger our self-healing (re-balancing) mechanism to create new replicas in the rest of the nodes. Eventually, volumes will be back to being fully protected.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard