Dynamic Rebalancing on Server Failure
Dynamic rebalancing on server failure is for a self-healing storage cluster. The cluster can recover the server failure’s impacted volumes from “Degraded” state, by dynamically moving replications of volumes from failed nodes to other healthy nodes, while preserving failure domain requirements.
This feature can be enabled or disabled by setting “fail-in-place” according to user requests (the default is enabled). The duration from the node failure until the cluster starts recovery of volumes is detrimented by a cluster configuration called “DurationToTurnIntoPermanentFailure”.
Test Purpose
The purpose of this test is to prove that this feature can work as expected, and that the impacted volumes can “self- heal” and recover to “FullyProtected” state after a certain period of time (the period of time for this depends on the value for –parameter=DurationToTurnIntoPermanentFailure).
Test Steps
- In the Lightbits storage cluster, check the “fail-in-place” setting and the “DurationToTurnIntoPermanentFailure” parameter, in order to shorten the test waiting time. Note that this could change to a smaller duration of time.
# Check feature flagroot@lightos-server00:~ lbcli get feature-flag fail-in-placeFeature Flag EnabledFailInPlace true[root@lightos -server00 ~] lbcli get cluster -config --parameter=DurationToTurnIntoPermanentFailureCluster Config Parameter ValueDurationToTurnIntoPermanentFailure 1h0m0s[root@lightos -server00 ~] lbcli update cluster -config --parameter=DurationToTurnIntoPermanentFailure --value =2h[root@lightos -server00 ~] lbcli get cluster -config --parameter=DurationToTurnIntoPermanentFailureCluster Config Parameter ValueDurationToTurnIntoPermanentFailure 2h1m0s- Create a few two-replication volumes (note that a test with three-replication volumes requires at least four storage servers.)
[root@lightos -server00 ~] lbcli create volume --project -name=default --name=rebalance -test -vol1 --replica -count =2 --size =200GB --acl=nqn .2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436cName UUID State ProtectionState NSID Size Replicas Compression ACLRebuild Progressrebalance -test -vol1 7c840113 -ec79 -42b6 -af6b -de3b1a5676a1 Creating Unknown0 186 GiB 2 false values:"nqn .2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c"[root@lightos -server00 ~] lbcli create volume --project -name=default --name=rebalance -test -vol2 --replica -count =2 --size =200GB --acl=nqn .2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436cName UUID State ProtectionState NSID Size Replicas Compression ACLRebuild Progressrebalance -test -vol2 c46f0217 -c813 -49ff -9c17 -f66b958ca349 Creating Unknown0 186 GiB 2 false values:"nqn .2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c"[root@lightos -server00 ~] lbcli create volume --project -name=default --name=rebalance -test -vol3 --replica -count =2 --size =200GB --acl=nqn .2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436cName UUID State ProtectionState NSID Size Replicas Compression ACLRebuild Progressrebalance -test -vol3 5ccb62dc -017c-4b73 -bfbf -fbad715b1175 Creating Unknown0 186 GiB 2 false values:"nqn .2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c"[root@lightos -server00 ~] lbcli list volumes |grep rebalance -test|sortrebalance -test -vol1 7c840113 -ec79 -42b6 -af6b -de3b1a5676a1 AvailableFullyProtected 16 186 GiB 2 false values:"nqn.2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" Nonerebalance -test -vol2 c46f0217 -c813 -49ff -9c17 -f66b958ca349 AvailableFullyProtected 17 186 GiB 2 false values:"nqn.2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" Nonerebalance -test -vol3 5ccb62dc -017c-4b73 -bfbf -fbad715b1175 AvailableFullyProtected 18 186 GiB 2 false values:"nqn.2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" None- Check these volumes on the client side, check the multi-path information, and use FIO to generate IO traffic to them.
[root@client -a ~] nvme listNode SN ModelNamespace Usage Format FW Rev---------------- -------------------- ------------------------------------------------- -------------------------- ---------------- --------/dev/nvme0n1 b4649e6a3496d720 Lightbits LightOS 16200.00 GB / 200.00 GB 4 KiB + 0 B 2.3/dev/nvme0n2 b4649e6a3496d720 Lightbits LightOS 17200.00 GB / 200.00 GB 4 KiB + 0 B 2.3/dev/nvme0n3 b4649e6a3496d720 Lightbits LightOS 18200.00 GB / 200.00 GB 4 KiB + 0 B 2.3[root@client -a ~] nvme list -subsys /dev/nvme0n1nvme -subsys0 - NQN=nqn .2016 -01. com.lightbitslabs:uuid:e7ba5876 -431b-43ca -a7e5 -4a0aaae3d1e1\+- nvme0 tcp traddr =10.20.130.10 trsvcid =4420 live optimized+- nvme1 tcp traddr =10.20.130.11 trsvcid =4420 live inaccessible+- nvme2 tcp traddr =10.20.130.12 trsvcid =4420 live[root@client -a ~] nvme list -subsys /dev/nvme0n2nvme -subsys0 - NQN=nqn .2016 -01. com.lightbitslabs:uuid:e7ba5876 -431b-43ca -a7e5 -4a0aaae3d1e1\+- nvme0 tcp traddr =10.20.130.10 trsvcid =4420 live optimized+- nvme1 tcp traddr =10.20.130.11 trsvcid =4420 live+- nvme2 tcp traddr =10.20.130.12 trsvcid =4420 live inaccessible[root@client -a ~] nvme list -subsys /dev/nvme0n3nvme -subsys0 - NQN=nqn .2016 -01. com.lightbitslabs:uuid:e7ba5876 -431b-43ca -a7e5 -4a0aaae3d1e1\+- nvme0 tcp traddr =10.20.130.10 trsvcid =4420 live optimized+- nvme1 tcp traddr =10.20.130.11 trsvcid =4420 live inaccessible+- nvme2 tcp traddr =10.20.130.12 trsvcid =4420 live[root@client -a ~] fio -ioengine=libaio -bs 4096k -filename =/dev/nvme0n1 :/dev/nvme0n2 :/dev/nvme0n3 -direct =1 --thread -rw=rw -runtime =3600 -name='bs 4096KB' -numjobs =8bs 4096KB: (g=0): rw=rw , bs=(R) 4096KiB -4096KiB , (W) 4096KiB -4096KiB , (T) 4096KiB-4096KiB , ioengine=libaio , iodepth =1...fio -3.7Starting 8 threadsJobs: 8 (f=24): [M(8) ][0.6%][r=1233 MiB/s,w=1449 MiB/s][r=308,w=362 IOPS][eta 28m:43s]- Shut down one server to simulate server failure, and check all the node statuses of the cluster, as well as the volume status in a healthy node.
[root@lightos -server01 ~] ip a | grep 10.20.130.11inet 10.20.130.11/24 brd 10.20.130.255 scope global noprefixroute enp59s0f0[root@lightos -server01 ~] halt -pRemote side unexpectedly closed network connection[root@lightos -server00 ~] lbcli list nodes |sortName UUID State NVMe endpointFailure domains Local rebuild progressserver00 -0 8630e6a8 -eae9 -595a -9380 -7974666 e9a8a Active 10.20.130.10:4420[server00] Noneserver01 -0 8544b302 -5118 -5a3c -bec8 -61 b224089654 Inactive 10.20.130.11:4420[server01] Noneserver02 -0 859bd46d -abe8 -54fa -81c4 -9683 f8705b65 Active 10.20.130.12:4420[server02] None[root@lightos -server00 ~] lbcli list volumesName UUID State ProtectionState NSID Size Replicas Compression ACLRebuild Progressrebalance -test -vol1 7c840113 -ec79 -42b6 -af6b -de3b1a5676a1 Available Degraded16 186 GiB 2 false values:"nqn .2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" Nonerebalance -test -vol2 c46f0217 -c813 -49ff -9c17 -f66b958ca349 AvailableFullyProtected 17 186 GiB 2 false values:"nqn.2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" Nonerebalance -test -vol3 5ccb62dc -017c-4b73 -bfbf -fbad715b1175 Available Degraded18 186 GiB 2 false values:"nqn .2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" None- Wait for a period of time (the period of time for this depends on the value for –parameter=DurationToTurnIntoPermanentFailure). The total taken time is DurationToTurnIntoPermanentFailure + replication rebalancing time - which is related to the volume’s used physical capacity. The volumes should be in “FullyProtected” state again.
[root@lightos -server00 ~] lbcli list volumes |sortName UUID State ProtectionState NSID Size Replicas Compression ACLRebuild Progressrebalance -test -vol1 7c840113 -ec79 -42b6 -af6b -de3b1a5676a1 Available FullyProtected16 186 GiB 2 false values:"nqn .2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" Nonerebalance -test -vol2 c46f0217 -c813 -49ff -9c17 -f66b958ca349 Available FullyProtected17 186 GiB 2 false values:"nqn .2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" Nonerebalance -test -vol3 5ccb62dc -017c-4b73 -bfbf -fbad715b1175 Available FullyProtected18 186 GiB 2 false values:"nqn .2014 -08. org.nvmexpress:uuid:a878c393 -29ec -494f-bba2 -098628 dc436c" None- On the client server side, use “nvme list-subsys” to check the volume path information. These impacted volumes should then have a new path.
[root@client -a ~] nvme list -subsys /dev/nvme0n1nvme -subsys0 - NQN=nqn .2016 -01. com.lightbitslabs:uuid:e7ba5876 -431b-43ca -a7e5 -4a0aaae3d1e1\+- nvme0 tcp traddr =10.20.130.10 trsvcid =4420 live optimized+- nvme1 tcp traddr =10.20.130.11 trsvcid =4420 connecting inaccessible+- nvme2 tcp traddr =10.20.130.12 trsvcid =4420 live inaccessible[root@client -a ~] nvme list -subsys /dev/nvme0n2nvme -subsys0 - NQN=nqn .2016 -01. com.lightbitslabs:uuid:e7ba5876 -431b-43ca -a7e5 -4a0aaae3d1e1\+- nvme0 tcp traddr =10.20.130.10 trsvcid =4420 live optimized+- nvme1 tcp traddr =10.20.130.11 trsvcid =4420 connecting+- nvme2 tcp traddr =10.20.130.12 trsvcid =4420 live inaccessible[root@client -a ~] nvme list -subsys /dev/nvme0n3nvme -subsys0 - NQN=nqn .2016 -01. com.lightbitslabs:uuid:e7ba5876 -431b-43ca -a7e5 -4a0aaae3d1e1\+- nvme0 tcp traddr =10.20.130.10 trsvcid =4420 live optimized+- nvme1 tcp traddr =10.20.130.11 trsvcid =4420 connecting inaccessible+- nvme2 tcp traddr =10.20.130.12 trsvcid =4420 live inaccessible