Server Failure Handling
Lightbits storage handles server failure with volume replication and an “Asymmetric Namespace Access” mechanism defined in the NVMe over Fabric Standard.
With this mechanism, you have multiple paths to different volume replication residing in the storage server. When primary replication residing in the Lightbits storage server fails (or the network is disconnected), the client will automatically switch the primary path to a new path and continue the IO. Typically, the path switching over takes less than 10 seconds (you should expect some IO hiccups, but it will recover quickly). The volumes that have replication residing in this failed server will become “Degraded” status.
Once the failed server gets back to the cluster again, “Volume Rebuild” will happen on these impacted volumes, and the data will be synced to the impacted replications and recover to “Healthy” status.
Test Purpose
The purpose of this test is to prove that this feature can work as expected with manual commands. A Linux command will be used to reboot the primary replication residing in the Lightbits server, and will check whether IO can be recovered quickly. It will also check that the “volume rebuild” happens as expected after the failed server comes back.
Test Steps
- Use the lbcli command to create one volume and bind it to the client server. For more, see Creating a Volume on the Lightbits Storage Server.
# Create volume
root@lightos-server00:~ lbcli create volume --project-name=default --name=ana-test-vol1 --replica-count=3 --size=96GB --acl="acl3"
Name UUID State Protection State NSID Size Replicas Compression ACL Rebuild Progress
ana-test-vol1 f1d50790-959f-40ed-988f-c4144a07f815 Creating Unknown 0 89 GiB 3 false values:"acl3"
# Check volume
root@lightos-server00:~ lbcli get volume --project-name=default --name=ana-test-vol1
Name UUID State Protection State NSID Size Replicas Compression ACL Rebuild Progress
ana-test-vol1 f1d50790-959f-40ed-988f-c4144a07f815 Available FullyProtected 8 89 GiB 3 false values:"acl3" None
- In Client A, use the “nvme list” command to check the multi-path of the newly created volume.
# List devices
root@client-a:~ nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ----------------------------------------
--------- -------------------------- ---------------- --------
/dev/nvme0n1 b4649e6a3496d720 Lightbits LightOS 8
96.00 GB / 96.00 GB 4 KiB + 0 B 2.3
# Show device controllers
root@client-a:~ nvme list-subsys /dev/nvme0n1
nvme-subsys0-NQN=nqn.2016-01.com.lightbitslabs:uuid:e7ba5876-431b-43ca-a7e5-4a0aaae3d1e1
\+
- nvme0 tcp traddr=10.20.130.10 trsvcid=4420 live optimized
+- nvme1 tcp traddr=10.20.130.11 trsvcid=4420 live inaccessible
+- nvme2 tcp traddr=10.20.130.12 trsvcid=4420 live inaccessible
The “live optimized” path is the primary replication resided path, while “live inaccessible” is the secondary replication resided path. Specific to this volume, we have three total replications - one primary path and two secondary paths.
3. Use FIO to generate a continuous IO load to this volume, as in the example below.
# Check FIO profile
root@client-a:~ cat rand_rw_70-30.fio
#FIO profile - Random 70/30:
[global]
runtime=6000
ramp_time=0
rw=randrw # READ/WRITE
rwmixread=70 # Here it 's used
refill_buffers
loops=1
buffer_compress_percentage=50
buffer_compress_chunk=4096
direct=1
norandommap=1
time_based
cpus_allowed_policy=split
log_avg_msec=1000
numjobs=8 # Number of CPU cores
cpus_allowed=0-7 # Names of CPUs (0-<line above -1>)
iodepth=12
randrepeat=0
ioengine=libaio
group_reporting=1
bs=4k
[job2]
filename=/dev/nvme0n1 # Device path
# Run FIO job
root@client-a:~ fio rand_rw_70-30.fio
job2: (g=0): rw=randrw , bs=(R) 4096B -4096B, (W) 4096B -4096B, (T) 4096B -4096B,
ioengine=libaio , iodepth =12
...
fio -3.7
Starting 8 processes
Jobs: 8 (f=8): [m(8) ][0.1%][r=522 MiB/s,w=224 MiB/s][r=134k,w=57.3k IOPS][eta 01h:39m
:54s]
- Power off the primary replication resided Lightbits storage server to simulate a server failure, monitor the FIO status, and check whether it can return to normal after a period of time (approximately 10 seconds). Use “nvme list-subsys /dev/nvme” to check the optimized path switching.
job2: (g=0): rw=randrw , bs=(R) 4096B -4096B, (W) 4096B -4096B, (T) 4096B -4096B,
ioengine=libaio , iodepth =12
...
fio -3.7
Starting 8 processes
Jobs: 8 (f=8): [m(8) ][4.7%][r=576 MiB/s,w=246 MiB/s][r=147k,w=62.0k IOPS][eta 01h:35m
:21s]
# Show device controllers
root@client-a:~ nvme list-subsys /dev/nvme0n1
nvme-subsys0-NQN=nqn.2016-01.com.lightbitslabs:uuid:e7ba5876-431b-43ca-a7e5-4a0aaae3d1e1
\+
- nvme0 tcp traddr=10.20.130.10 trsvcid=4420 connecting optimized
+- nvme1 tcp traddr=10.20.130.11 trsvcid=4420 live optimized
+- nvme2 tcp traddr=10.20.130.12 trsvcid=4420 live inaccessible
Expect that the “live optimized” path will switch to a new storage server that is still working, and that the original primary path becomes “connecting optimized”.
- Use “lbcli list nodes” to view the cluster status of the remaining storage servers, and use “lbcli list volumes” to view the status of the affected volume.
# List nodes, one should be inactive
root@lightos-server01:~ lbcli list nodes
Name UUID State NVMe endpoint Failure domains Local rebuild progress
server01-0 8544b302-5118-5a3c-bec8-61b224089654 Active 10.20.130.11:4420 [server01] None
server02-0 859bd46d-abe8-54fa-81c4-9683f8705b65 Active 10.20.130.12:4420 [server02] None
server00-0 8630e6a8-eae9-595a-9380-7974666e9a8a Inactive 10.20.130.10:4420 [server00] None
# List volumes, it should show as degraded
root@lightos-server01:~ lbcli list volumes
Name UUID State Protection State NSID Size Replicas Compression ACL Rebuild Progress
ana-test-vol1 f1d50790-959f-40ed-988f-c4144a07f815 Available Degraded 8 89 GiB 3 false values:"acl3" None
The shutdown server becomes “Inactive” and the volume created for this test becomes “Degraded”. Other volumes could be impacted too, depending on where the replication is located.
- Power up the server again. It will rejoin the cluster automatically. Use “lbcli list nodes” to view the cluster status, and use “lbcli get volume” to monitor the rebuild progress of the impacted volume. It should eventually become “FullyProtected”.
# List nodes, all should be active
root@lightos-server01:~ lbcli list nodes
Name UUID State NVMe endpoint Failure domains Local rebuild progress
server01-0 8544b302-5118-5a3c-bec8-61b224089654 Active 10.20.130.11:4420 [server01] None
server02-0 859bd46d-abe8-54fa-81c4-9683f8705b65 Active 10.20.130.12:4420 [server02] None
server00-0 8630e6a8-eae9-595a-9380-7974666e9a8a Active 10.20.130.10:4420 [server00] None
# Get volume and check rebuild
root@lightos-server01:~ lbcli get volume --name=ana-test-vol1 --project-name=default
Name UUID State Protection State NSID Size Replicas Compression ACL Rebuild Progress
ana-test-vol1 f1d50790-959f-40ed-988f-c4144a07f815 Available Degraded 8 89 GiB 3 false values:"acl3" 18
# Keep running command until "Rebuild Progress" shows "None"
root@lightos-server01:~ lbcli get volume --name=ana-test-vol1 --project-name=default
Name UUID State Protection State NSID Size Replicas Compression ACL Rebuild Progress
ana-test-vol1 f1d50790-959f-40ed-988f-c4144a07f815 Available Degraded 8 89 GiB 3 false values:"acl3" None
- Note that the server shutdown time should be less than “DurationToTurnIntoPermanentFailure” if “fail-in-place” is enabled for dynamic rebalancing. The default is one hour. If it exceeds this time, the server failure will be treated as a permanent failure, and the impacted replications will be rebalanced to other working nodes.