SSD Failure Handling

Lightbits storage handles NVMe SSD failure with Elastic Raid capability, protecting the data stored in the Lightbits storage instance with an N+1 Erasure Coding mechanism (with N+1 representing the total number of drives in the instance).

If one SSD is failed, or removed, this feature ensures that the storage service can continue.

In parallel, the Lightbits storage instance will start the “local rebuilding” to make the data become new “N’+1” protected again. In this case, N’ is actually now N-1 because one drive was removed. So essentially after a drive is removed or fails, it reprotects to ((N-1) + 1).

This feature can be enabled or disabled during the installation. Note also that after adding a drive in properly, it will reprotect back to N+1. The rebuild we are seeing is that protection.

If another drive fails after the rebuild, it will rebuild again to (N-2) + 1. Capacity lowers with each drive failure/removal reprotection, so we want to make sure we are not at usage capacity. Additionally, EC works with eight or more drives.

If another SSD in the same Lightbits storage instance fails during the local rebuilding, this Lightbits storage instance will become inactive. However, at that level, it is protected by nodes.

Test Purpose

The purpose of this test is to prove that this feature can work as expected with manual commands. It will use a Linux command to remove one specific NVMe SSD from the PCIe bus to simulate the SSD failure or removal, and check whether the IO can continue. It also checks the “local rebuild” progress.

Test Steps

  1. Create a volume for one specific client, and then from the client side check the multi-path information of this specific volume to know the location of the primary replication. And then use FIO to generate a continuous IO load to this volume. This is shown in the example below.
Bash
Copy
  1. On the primary replication resided storage server side, use the following Linux command to remove one specific NVMe device from the PCIe bus, to simulate the SSD removal. Use the “nvme list” command to verify whether the SSD was removed.
Bash
Copy
  1. Check the status of FIO. The IO should continue even after this SSD is removed.
Bash
Copy
  1. Check the Lightbits storage instance status and verify the local rebuild progress. Typically this rebuild takes one to a few hours, depending on how many SSDs are installed and how much data is written.
Bash
Copy
  1. Check the Grafana monitoring GUI, to verify that the warning happened as expected (Cluster_tab). You can also monitor the local rebuild progress with the Grafana GUI.

To re-add the device (to linux), run:

Bash
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
On This Page
SSD Failure Handling