Dynamic Rebalancing on Server Failure

Dynamic rebalancing on server failure is for a self-healing storage cluster. The cluster can recover the server failure’s impacted volumes from “Degraded” state, by dynamically moving replications of volumes from failed nodes to other healthy nodes, while preserving failure domain requirements.

This feature can be enabled or disabled by setting “fail-in-place” according to user requests (the default is enabled). The duration from the node failure until the cluster starts recovery of volumes is detrimented by a cluster configuration called “DurationToTurnIntoPermanentFailure”.

Test Purpose

The purpose of this test is to prove that this feature can work as expected, and that the impacted volumes can “self- heal” and recover to “FullyProtected” state after a certain period of time (the period of time for this depends on the value for –parameter=DurationToTurnIntoPermanentFailure).

Test Steps

  1. In the Lightbits storage cluster, check the “fail-in-place” setting and the “DurationToTurnIntoPermanentFailure” parameter, in order to shorten the test waiting time. Note that this could change to a smaller duration of time.
Bash
Copy
  1. Create a few two-replication volumes (note that a test with three-replication volumes requires at least four storage servers.)
Bash
Copy
  1. Check these volumes on the client side, check the multi-path information, and use FIO to generate IO traffic to them.
Bash
Copy
  1. Shut down one server to simulate server failure, and check all the node statuses of the cluster, as well as the volume status in a healthy node.
Bash
Copy
  1. Wait for a period of time (the period of time for this depends on the value for –parameter=DurationToTurnIntoPermanentFailure). The total taken time is DurationToTurnIntoPermanentFailure + replication rebalancing time - which is related to the volume’s used physical capacity. The volumes should be in “FullyProtected” state again.
Bash
Copy
  1. On the client server side, use “nvme list-subsys” to check the volume path information. These impacted volumes should then have a new path.
Bash
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard