Lightbits Elastic Raid Implementation

Lightbits storage handles SSD failure with an Elastic Raid capability, protecting the data stored in the Lightbits storage instance with an N+1 Erasure Coding mechanism (with N+1 representing the total number of drives in the instance).

If one SSD has failed, or is removed, this feature ensures that the storage service can continue.

In parallel, the Lightbits storage instance will start the "local rebuilding" to make the data become new "N'+1" protected again. In this case, N' is actually now N-1 because one drive was removed. So essentially after a drive is removed or fails, it reprotects to ((N-1) + 1).

This feature can be enabled or disabled during the installation. Note also that after adding a drive in properly, it will reprotect back to N+1. The rebuild that occurs is that protection.

If another drive fails after the rebuild, it will rebuild again to (N-2) + 1. Capacity lowers with each drive failure/removal reprotection, so it is important to not get to usage capacity. Additionally, EC works with eight or more drives.

Assuming you have 12 drives, this will be 11+1. If a drive fails, it will drop to 10+1. If another drive fails, it will drop to 9+1. This can go on as long as there is space and the sum does not drop below 8. However, at 10+1, if a drive is added, it will reprotect back up to 11+1.

If another SSD in the same Lightbits storage instance fails during the local rebuilding, this Lightbits storage instance will become inactive. However, at that level, it is protected by the nodes.

While the local rebuild is in progress, you will be vulnerable to SSD failures. You still have replication, so if a second SSD fails while the rebuilding is still running from the first SSD failure, Lightbits will consider the node as failed and then clients will fail-over and continue using replicas in the other nodes.

The node will then be considered as permanently failed, and affected volumes will be degraded. Lightbits will then trigger the self-healing (re-balancing) mechanism to create new replicas in the rest of the nodes. Eventually, volumes will be back to being fully protected.

This section is specific for data SSD devices and is separate from the Journaling SSD devices.

Last updated on

Was this page helpful?