Server Failure Handling

Lightbits storage handles server failure with volume replication and an “Asymmetric Namespace Access” mechanism defined in the NVMe over Fabric Standard.

With this mechanism, you have multiple paths to different volume replication residing in the storage server. When primary replication residing in the Lightbits storage server fails (or the network is disconnected), the client will automatically switch the primary path to a new path and continue the IO. Typically, the path switching over takes less than 10 seconds (you should expect some IO hiccups, but it will recover quickly). The volumes that have replication residing in this failed server will become “Degraded” status.

Once the failed server gets back to the cluster again, “Volume Rebuild” will happen on these impacted volumes, and the data will be synced to the impacted replications and recover to “Healthy” status.

Test Purpose

The purpose of this test is to prove that this feature can work as expected with manual commands. A Linux command will be used to reboot the primary replication residing in the Lightbits server, and will check whether IO can be recovered quickly. It will also check that the “volume rebuild” happens as expected after the failed server comes back.

Test Steps

  1. Use the lbcli command to create one volume and bind it to the client server. For more, see Creating a Volume on the Lightbits Storage Server.
Bash
Copy
  1. In Client A, use the “nvme list” command to check the multi-path of the newly created volume.
Bash
Copy

The “live optimized” path is the primary replication resided path, while “live inaccessible” is the secondary replication resided path. Specific to this volume, we have three total replications - one primary path and two secondary paths.

3. Use FIO to generate a continuous IO load to this volume, as in the example below.

Bash
Copy
  1. Power off the primary replication resided Lightbits storage server to simulate a server failure, monitor the FIO status, and check whether it can return to normal after a period of time (approximately 10 seconds). Use “nvme list-subsys /dev/nvme” to check the optimized path switching.
Bash
Copy

Expect that the “live optimized” path will switch to a new storage server that is still working, and that the original primary path becomes “connecting optimized”.

  1. Use “lbcli list nodes” to view the cluster status of the remaining storage servers, and use “lbcli list volumes” to view the status of the affected volume.
Bash
Copy

The shutdown server becomes “Inactive” and the volume created for this test becomes “Degraded”. Other volumes could be impacted too, depending on where the replication is located.

  1. Power up the server again. It will rejoin the cluster automatically. Use “lbcli list nodes” to view the cluster status, and use “lbcli get volume” to monitor the rebuild progress of the impacted volume. It should eventually become “FullyProtected”.
Bash
Copy
  1. Note that the server shutdown time should be less than “DurationToTurnIntoPermanentFailure” if “fail-in-place” is enabled for dynamic rebalancing. The default is one hour. If it exceeds this time, the server failure will be treated as a permanent failure, and the impacted replications will be rebalanced to other working nodes.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard