Lightbits storage handles server failure with volume replication and an “Asymmetric Namespace Access” mechanism defined in the NVMe over Fabric Standard.
With this mechanism, you have multiple paths to different volume replication residing in the storage server. When primary replication residing in the Lightbits storage server fails (or the network is disconnected), the client will automatically switch the primary path to a new path and continue the IO. Typically, the path switching over takes less than 10 seconds (you should expect some IO hiccups, but it will recover quickly). The volumes that have replication residing in this failed server will become “Degraded” status.
Once the failed server gets back to the cluster again, “Volume Rebuild” will happen on these impacted volumes, and the data will be synced to the impacted replications and recover to “Healthy” status.
Test Purpose
The purpose of this test is to prove that this feature can work as expected with manual commands. A Linux command will be used to reboot the primary replication residing in the Lightbits server, and will check whether IO can be recovered quickly. It will also check that the “volume rebuild” happens as expected after the failed server comes back.
Test Steps
- Use the lbcli command to create one volume and bind it to the client server. For more, see Creating a Volume on the Lightbits Storage Server.
# Create volumeroot@lightos-server00:~ lbcli create volume --project-name=default --name=ana-test-vol1 --replica-count=3 --size=96GB --acl="acl3"Name UUID State Protection State NSID Size Replicas Compression ACL Rebuild Progressana-test-vol1 f1d50790-959f-40ed-988f-c4144a07f815 Creating Unknown 0 89 GiB 3 false values:"acl3"# Check volumeroot@lightos-server00:~ lbcli get volume --project-name=default --name=ana-test-vol1Name UUID State Protection State NSID Size Replicas Compression ACL Rebuild Progressana-test-vol1 f1d50790-959f-40ed-988f-c4144a07f815 Available FullyProtected 8 89 GiB 3 false values:"acl3" None- In Client A, use the “nvme list” command to check the multi-path of the newly created volume.
# List devicesroot@client-a:~ nvme listNode SN Model Namespace Usage Format FW Rev---------------- -------------------- ------------------------------------------------- -------------------------- ---------------- --------/dev/nvme0n1 b4649e6a3496d720 Lightbits LightOS 896.00 GB / 96.00 GB 4 KiB + 0 B 2.3# Show device controllersroot@client-a:~ nvme list-subsys /dev/nvme0n1nvme-subsys0-NQN=nqn.2016-01.com.lightbitslabs:uuid:e7ba5876-431b-43ca-a7e5-4a0aaae3d1e1\+- nvme0 tcp traddr=10.20.130.10 trsvcid=4420 live optimized+- nvme1 tcp traddr=10.20.130.11 trsvcid=4420 live inaccessible+- nvme2 tcp traddr=10.20.130.12 trsvcid=4420 live inaccessibleThe “live optimized” path is the primary replication resided path, while “live inaccessible” is the secondary replication resided path. Specific to this volume, we have three total replications - one primary path and two secondary paths.
3. Use FIO to generate a continuous IO load to this volume, as in the example below.
# Check FIO profileroot@client-a:~ cat rand_rw_70-30.fio#FIO profile - Random 70/30:[global]runtime=6000ramp_time=0rw=randrw # READ/WRITErwmixread=70 # Here it 's usedrefill_buffersloops=1buffer_compress_percentage=50buffer_compress_chunk=4096direct=1norandommap=1time_basedcpus_allowed_policy=splitlog_avg_msec=1000numjobs=8 # Number of CPU corescpus_allowed=0-7 # Names of CPUs (0-<line above -1>)iodepth=12randrepeat=0ioengine=libaiogroup_reporting=1bs=4k[job2]filename=/dev/nvme0n1 # Device path# Run FIO jobroot@client-a:~ fio rand_rw_70-30.fiojob2: (g=0): rw=randrw , bs=(R) 4096B -4096B, (W) 4096B -4096B, (T) 4096B -4096B,ioengine=libaio , iodepth =12...fio -3.7Starting 8 processesJobs: 8 (f=8): [m(8) ][0.1%][r=522 MiB/s,w=224 MiB/s][r=134k,w=57.3k IOPS][eta 01h:39m:54s]- Power off the primary replication resided Lightbits storage server to simulate a server failure, monitor the FIO status, and check whether it can return to normal after a period of time (approximately 10 seconds). Use “nvme list-subsys /dev/nvme” to check the optimized path switching.
job2: (g=0): rw=randrw , bs=(R) 4096B -4096B, (W) 4096B -4096B, (T) 4096B -4096B,ioengine=libaio , iodepth =12...fio -3.7Starting 8 processesJobs: 8 (f=8): [m(8) ][4.7%][r=576 MiB/s,w=246 MiB/s][r=147k,w=62.0k IOPS][eta 01h:35m:21s]# Show device controllersroot@client-a:~ nvme list-subsys /dev/nvme0n1nvme-subsys0-NQN=nqn.2016-01.com.lightbitslabs:uuid:e7ba5876-431b-43ca-a7e5-4a0aaae3d1e1\+- nvme0 tcp traddr=10.20.130.10 trsvcid=4420 connecting optimized+- nvme1 tcp traddr=10.20.130.11 trsvcid=4420 live optimized+- nvme2 tcp traddr=10.20.130.12 trsvcid=4420 live inaccessibleExpect that the “live optimized” path will switch to a new storage server that is still working, and that the original primary path becomes “connecting optimized”.
- Use “lbcli list nodes” to view the cluster status of the remaining storage servers, and use “lbcli list volumes” to view the status of the affected volume.
# List nodes, one should be inactiveroot@lightos-server01:~ lbcli list nodesName UUID State NVMe endpoint Failure domains Local rebuild progressserver01-0 8544b302-5118-5a3c-bec8-61b224089654 Active 10.20.130.11:4420 [server01] Noneserver02-0 859bd46d-abe8-54fa-81c4-9683f8705b65 Active 10.20.130.12:4420 [server02] Noneserver00-0 8630e6a8-eae9-595a-9380-7974666e9a8a Inactive 10.20.130.10:4420 [server00] None# List volumes, it should show as degradedroot@lightos-server01:~ lbcli list volumesName UUID State Protection State NSID Size Replicas Compression ACL Rebuild Progressana-test-vol1 f1d50790-959f-40ed-988f-c4144a07f815 Available Degraded 8 89 GiB 3 false values:"acl3" NoneThe shutdown server becomes “Inactive” and the volume created for this test becomes “Degraded”. Other volumes could be impacted too, depending on where the replication is located.
- Power up the server again. It will rejoin the cluster automatically. Use “lbcli list nodes” to view the cluster status, and use “lbcli get volume” to monitor the rebuild progress of the impacted volume. It should eventually become “FullyProtected”.
# List nodes, all should be activeroot@lightos-server01:~ lbcli list nodesName UUID State NVMe endpoint Failure domains Local rebuild progressserver01-0 8544b302-5118-5a3c-bec8-61b224089654 Active 10.20.130.11:4420 [server01] Noneserver02-0 859bd46d-abe8-54fa-81c4-9683f8705b65 Active 10.20.130.12:4420 [server02] Noneserver00-0 8630e6a8-eae9-595a-9380-7974666e9a8a Active 10.20.130.10:4420 [server00] None# Get volume and check rebuildroot@lightos-server01:~ lbcli get volume --name=ana-test-vol1 --project-name=defaultName UUID State Protection State NSID Size Replicas Compression ACL Rebuild Progressana-test-vol1 f1d50790-959f-40ed-988f-c4144a07f815 Available Degraded 8 89 GiB 3 false values:"acl3" 18# Keep running command until "Rebuild Progress" shows "None"root@lightos-server01:~ lbcli get volume --name=ana-test-vol1 --project-name=defaultName UUID State Protection State NSID Size Replicas Compression ACL Rebuild Progressana-test-vol1 f1d50790-959f-40ed-988f-c4144a07f815 Available Degraded 8 89 GiB 3 false values:"acl3" None- Note that the server shutdown time should be less than “DurationToTurnIntoPermanentFailure” if “fail-in-place” is enabled for dynamic rebalancing. The default is one hour. If it exceeds this time, the server failure will be treated as a permanent failure, and the impacted replications will be rebalanced to other working nodes.