Dynamic Rebalancing on Server Failure
Dynamic rebalancing on server failure is for a self-healing storage cluster. The cluster can recover the server failure’s impacted volumes from “Degraded” state, by dynamically moving replications of volumes from failed nodes to other healthy nodes, while preserving failure domain requirements.
This feature can be enabled or disabled by setting “fail-in-place” according to user requests (the default is enabled). The duration from the node failure until the cluster starts recovery of volumes is detrimented by a cluster configuration called “DurationToTurnIntoPermanentFailure”.
Test Purpose
The purpose of this test is to prove that this feature can work as expected, and that the impacted volumes can “self- heal” and recover to “FullyProtected” state after a certain period of time (the period of time for this depends on the value for –parameter=DurationToTurnIntoPermanentFailure).
Test Steps
- In the Lightbits storage cluster, check the “fail-in-place” setting and the “DurationToTurnIntoPermanentFailure” parameter, in order to shorten the test waiting time. Note that this could change to a smaller duration of time.
# Check feature flag
root@lightos-server00:~ lbcli get feature-flag fail-in-place
Feature Flag Enabled
FailInPlace true
[root@lightos -server00 ~] lbcli get cluster -config --parameter=
DurationToTurnIntoPermanentFailure
Cluster Config Parameter Value
DurationToTurnIntoPermanentFailure 1h0m0s
[root@lightos -server00 ~] lbcli update cluster -config --parameter=
DurationToTurnIntoPermanentFailure --value =2h
[root@lightos -server00 ~] lbcli get cluster -config --parameter=
DurationToTurnIntoPermanentFailure
Cluster Config Parameter Value
DurationToTurnIntoPermanentFailure 2h1m0s
- Create a few two-replication volumes (note that a test with three-replication volumes requires at least four storage servers.)
[root@lightos -server00 ~] lbcli create volume --project -name=default --name=
rebalance -test -vol1 --replica -count =2 --size =200GB --acl=nqn .2014 -08. org.
nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c
Name UUID State Protection
State NSID Size Replicas Compression ACL
Rebuild Progress
rebalance -test -vol1 7c840113 -ec79 -42b6 -af6b -de3b1a5676a1 Creating Unknown
0 186 GiB 2 false values:"nqn .2014 -08. org.
nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c"
[root@lightos -server00 ~] lbcli create volume --project -name=default --name=
rebalance -test -vol2 --replica -count =2 --size =200GB --acl=nqn .2014 -08. org.
nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c
Name UUID State Protection
State NSID Size Replicas Compression ACL
Rebuild Progress
rebalance -test -vol2 c46f0217 -c813 -49ff -9c17 -f66b958ca349 Creating Unknown
0 186 GiB 2 false values:"nqn .2014 -08. org.
nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c"
[root@lightos -server00 ~] lbcli create volume --project -name=default --name=
rebalance -test -vol3 --replica -count =2 --size =200GB --acl=nqn .2014 -08. org.
nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c
Name UUID State Protection
State NSID Size Replicas Compression ACL
Rebuild Progress
rebalance -test -vol3 5ccb62dc -017c-4b73 -bfbf -fbad715b1175 Creating Unknown
0 186 GiB 2 false values:"nqn .2014 -08. org.
nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c"
[root@lightos -server00 ~] lbcli list volumes |grep rebalance -test|sort
rebalance -test -vol1 7c840113 -ec79 -42b6 -af6b -de3b1a5676a1 Available
FullyProtected 16 186 GiB 2 false values:"nqn
.2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" None
rebalance -test -vol2 c46f0217 -c813 -49ff -9c17 -f66b958ca349 Available
FullyProtected 17 186 GiB 2 false values:"nqn
.2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" None
rebalance -test -vol3 5ccb62dc -017c-4b73 -bfbf -fbad715b1175 Available
FullyProtected 18 186 GiB 2 false values:"nqn
.2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" None
- Check these volumes on the client side, check the multi-path information, and use FIO to generate IO traffic to them.
[root@client -a ~] nvme list
Node SN Model
Namespace Usage Format FW Rev
---------------- -------------------- ----------------------------------------
--------- -------------------------- ---------------- --------
/dev/nvme0n1 b4649e6a3496d720 Lightbits LightOS 16
200.00 GB / 200.00 GB 4 KiB + 0 B 2.3
/dev/nvme0n2 b4649e6a3496d720 Lightbits LightOS 17
200.00 GB / 200.00 GB 4 KiB + 0 B 2.3
/dev/nvme0n3 b4649e6a3496d720 Lightbits LightOS 18
200.00 GB / 200.00 GB 4 KiB + 0 B 2.3
[root@client -a ~] nvme list -subsys /dev/nvme0n1
nvme -subsys0 - NQN=nqn .2016 -01. com.lightbitslabs:uuid:e7ba5876 -431b-43ca -a7e5 -4
a0aaae3d1e1
\+
- nvme0 tcp traddr =10.20.130.10 trsvcid =4420 live optimized
+- nvme1 tcp traddr =10.20.130.11 trsvcid =4420 live inaccessible
+- nvme2 tcp traddr =10.20.130.12 trsvcid =4420 live
[root@client -a ~] nvme list -subsys /dev/nvme0n2
nvme -subsys0 - NQN=nqn .2016 -01. com.lightbitslabs:uuid:e7ba5876 -431b-43ca -a7e5 -4
a0aaae3d1e1
\+
- nvme0 tcp traddr =10.20.130.10 trsvcid =4420 live optimized
+- nvme1 tcp traddr =10.20.130.11 trsvcid =4420 live
+- nvme2 tcp traddr =10.20.130.12 trsvcid =4420 live inaccessible
[root@client -a ~] nvme list -subsys /dev/nvme0n3
nvme -subsys0 - NQN=nqn .2016 -01. com.lightbitslabs:uuid:e7ba5876 -431b-43ca -a7e5 -4
a0aaae3d1e1
\+
- nvme0 tcp traddr =10.20.130.10 trsvcid =4420 live optimized
+- nvme1 tcp traddr =10.20.130.11 trsvcid =4420 live inaccessible
+- nvme2 tcp traddr =10.20.130.12 trsvcid =4420 live
[root@client -a ~] fio -ioengine=libaio -bs 4096k -filename =/dev/nvme0n1 :/dev/
nvme0n2 :/dev/nvme0n3 -direct =1 --thread -rw=rw -runtime =3600 -name='bs 4096KB' -
numjobs =8
bs 4096KB: (g=0): rw=rw , bs=(R) 4096KiB -4096KiB , (W) 4096KiB -4096KiB , (T) 4096KiB
-4096KiB , ioengine=libaio , iodepth =1
...
fio -3.7
Starting 8 threads
Jobs: 8 (f=24): [M(8) ][0.6%][r=1233 MiB/s,w=1449 MiB/s][r=308,w=362 IOPS][eta 28m:43s
]
- Shut down one server to simulate server failure, and check all the node statuses of the cluster, as well as the volume status in a healthy node.
[root@lightos -server01 ~] ip a | grep 10.20.130.11
inet 10.20.130.11/24 brd 10.20.130.255 scope global noprefixroute enp59s0f0
[root@lightos -server01 ~] halt -p
Remote side unexpectedly closed network connection
[root@lightos -server00 ~] lbcli list nodes |sort
Name UUID State NVMe endpoint
Failure domains Local rebuild progress
server00 -0 8630e6a8 -eae9 -595a -9380 -7974666 e9a8a Active 10.20.130.10:4420
[server00] None
server01 -0 8544b302 -5118 -5a3c -bec8 -61 b224089654 Inactive 10.20.130.11:4420
[server01] None
server02 -0 859bd46d -abe8 -54fa -81c4 -9683 f8705b65 Active 10.20.130.12:4420
[server02] None
[root@lightos -server00 ~] lbcli list volumes
Name UUID State Protection
State NSID Size Replicas Compression ACL
Rebuild Progress
rebalance -test -vol1 7c840113 -ec79 -42b6 -af6b -de3b1a5676a1 Available Degraded
16 186 GiB 2 false values:"nqn .2014 -08. org.
nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" None
rebalance -test -vol2 c46f0217 -c813 -49ff -9c17 -f66b958ca349 Available
FullyProtected 17 186 GiB 2 false values:"nqn
.2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" None
rebalance -test -vol3 5ccb62dc -017c-4b73 -bfbf -fbad715b1175 Available Degraded
18 186 GiB 2 false values:"nqn .2014 -08. org.
nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" None
- Wait for a period of time (the period of time for this depends on the value for –parameter=DurationToTurnIntoPermanentFailure). The total taken time is DurationToTurnIntoPermanentFailure + replication rebalancing time - which is related to the volume’s used physical capacity. The volumes should be in “FullyProtected” state again.
[root@lightos -server00 ~] lbcli list volumes |sort
Name UUID State Protection
State NSID Size Replicas Compression ACL
Rebuild Progress
rebalance -test -vol1 7c840113 -ec79 -42b6 -af6b -de3b1a5676a1 Available FullyProtected
16 186 GiB 2 false values:"nqn .2014 -08. org.nvmexpress:
uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" None
rebalance -test -vol2 c46f0217 -c813 -49ff -9c17 -f66b958ca349 Available FullyProtected
17 186 GiB 2 false values:"nqn .2014 -08. org.nvmexpress:
uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" None
rebalance -test -vol3 5ccb62dc -017c-4b73 -bfbf -fbad715b1175 Available FullyProtected
18 186 GiB 2 false values:"nqn .2014 -08. org.nvmexpress:
uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" None
- On the client server side, use “nvme list-subsys” to check the volume path information. These impacted volumes should then have a new path.
[root@client -a ~] nvme list -subsys /dev/nvme0n1
nvme -subsys0 - NQN=nqn .2016 -01. com.lightbitslabs:uuid:e7ba5876 -431b-43ca -a7e5 -4
a0aaae3d1e1
\+
- nvme0 tcp traddr =10.20.130.10 trsvcid =4420 live optimized
+- nvme1 tcp traddr =10.20.130.11 trsvcid =4420 connecting inaccessible
+- nvme2 tcp traddr =10.20.130.12 trsvcid =4420 live inaccessible
[root@client -a ~] nvme list -subsys /dev/nvme0n2
nvme -subsys0 - NQN=nqn .2016 -01. com.lightbitslabs:uuid:e7ba5876 -431b-43ca -a7e5 -4
a0aaae3d1e1
\+
- nvme0 tcp traddr =10.20.130.10 trsvcid =4420 live optimized
+- nvme1 tcp traddr =10.20.130.11 trsvcid =4420 connecting
+- nvme2 tcp traddr =10.20.130.12 trsvcid =4420 live inaccessible
[root@client -a ~] nvme list -subsys /dev/nvme0n3
nvme -subsys0 - NQN=nqn .2016 -01. com.lightbitslabs:uuid:e7ba5876 -431b-43ca -a7e5 -4
a0aaae3d1e1
\+
- nvme0 tcp traddr =10.20.130.10 trsvcid =4420 live optimized
+- nvme1 tcp traddr =10.20.130.11 trsvcid =4420 connecting inaccessible
+- nvme2 tcp traddr =10.20.130.12 trsvcid =4420 live inaccessible