Dynamic Rebalance

In Dynamic Rebalance the cluster will try to move volumes from one node to other nodes. There is a need to distinguish between proactive rebalance and fail in place. Dynamic rebalancing and fail in place are enabled by default, and disabling these is not recommended.

Fail in Place

When fail in place mode activates, the Lightbits cluster will try to move replications of volumes from failed nodes to other healthy nodes, while preserving failure domain requirements. Fail in place in Lightbits SDS in AWS will be triggered by the auto maintenance process. See Auto Healing Overview for more information. Fail in place can be disabled if required (this is not recommended unless instructed by Lightbits Support.

Fail in place and the recovery of data that was on the failed server will be triggered when the cluster manager marks the node as “permanently failed”. The cluster will do this after a certain duration of time that the node is not available. This is in order to make sure that the error/problem is permanent.

The duration of time is from node failure detection to the point the cluster manager marks the node as “permanently failed”. The duration of time can be set by a cluster configuration called DurationToTurnIntoPermanentFailure> .

Fail in Place CLI Example:

Fail in place mode is activated/deactivated by enabling/disabling the feature-flag.

For example, to enable/disable a cluster feature flag with a given feature flag name:

Bash
    
 
lbcli -J $LIGHTOS_JWT disable feature-flag fail-in-placelbcli -J $LIGHTOS_JWT enable feature-flag fail-in-place
Copy

The following is an example of how to set duration to cluster rebalance.

Bash
    
 
lbcli -J $LIGHTOS_JWT update cluster-config --parameter=DurationToTurnIntoPermanentFailure --value=20m
Copy

Proactive Rebalance

The proactive rebalance feature enables the cluster to automatically balance volumes between nodes based on capacity. This will prevent scenarios where one storage node in the cluster is over capacity reaching read-only status, while other nodes have available space to serve more capacity.

By default this feature is enabled.

During proactive rebalance of a volume, the protection state of a volume is kept according to the state of nodes in the cluster. If all nodes are available, the volume will also be fully protected during the rebalance process.

When a volume migrates from a source node, it will create another temporary replica of the volume. Once all data has been synced to the temporary replica, the source node replica will be removed and the cluster will select a new primary node.

The following section describes when proactive rebalance is activated. All decisions are based on how far away the nodes are from the “read-only” state.

There are two primary indications for read-only state:

Storage effective capacity
Available RAM for metadata

There are two primary reasons that the cluster will trigger a proactive rebalance:

A node is getting close to read-only state:

a. If the node utilization is 10% (default) from the read-only threshold (nodeRebalanceTriggerThresholdPercentage). b. There is a destination node that has enough capacity to receive the migrated volumes - at least 30% (default) free capacity from the read-only threshold (nodeRebalanceTargetThreshold).

Cluster capacity imbalance (nodes with very high utilization and nodes with very low utilization):

a. There is a node in the cluster where utilization is under 20% (default) of the total node capacity (nodeUnderUtilizeThreshold). b. The capacity difference between the two farthest nodes (the node most utilized and the node least utilized) is more than 30% (default). (nodeUtilizationDifferenceThreshold).

Proactive Rebalance CLI Example:

Proactive rebalance mode is activated/deactivated by enabling/disabling the feature-flag.

For example, to enable/disable a cluster feature flag with a given feature flag name:

Bash
    
 
lbcli -J $LIGHTOS_JWT disable feature-flag proactive-rebalancelbcli -J $LIGHTOS_JWT enable feature-flag proactive-rebalance
Copy

Last updated on

Was this page helpful?