Known Issues in Lightbits 3.9.1

AI Tools

ID	Description
40068	In rare cases, a newly-created volume could be assigned the same NSID as an existing volume. This condition can lead to incorrect delete or update operations for volumes sharing the same NSID. If this issue is encountered, contact Lightbits Support for a manual remediation procedure to identify and fix the affected volumes.
39211	When deleting the most recent snapshot of a volume while a node holding a replica is offline, recently written data could revert to the data stored in that snapshot if the node later becomes the primary.
38043	If encryption was turned on but enabling it failed - resulting in the creation of an 'EnableServerEncryptionFailed' event - the API service will return stale events. Any event from that point onward that exists in the system will not be returned by the "ListEvents" API. As a workaround, check if this event exists before upgrading to 3.14/3.15.1. Note that a similar issue could also occur when a cluster has double disk failure on one of the servers (or single disk failure with no EC), and Lightbits 3.2.x or older was used at the time of failure.
37831	In some cases, silent data corruption on an SSD could cause a node crash instead of attempting to recover the data and reporting an event. This can occur if the SSD returns invalid data rather than an I/O error.
37395	In some rare racy conditions a server may remain stuck in a deleting state.
37205	Incorrect handling of IO errors from NVMe SSDs during abrupt recovery may cause node recovery to fail.
36722	Users can reference the NVMe device by its path name (e.g., /dev/nvme0n1) - as used during the initial system setup - to determine the storage SSD used by servers in the Lightbits storage cluster. However, this could lead to data loss since device names are not persistent across reboots.
36090	Due to a rare internal error, involved long network disconnections nodes might lose service and stay in Inactive state - even though the node should be active.
36089	Under rare situations involving stress on the cluster that includes rebalance activity accompanied by disconnections from etcd, node manager may crash and restart, or fail to complete rebuilds, and volumes may be stuck in Migrating state. A workaround if this happens is to restart the affected node manager.
35837	On a single-instance service on a machine with multiple numa-nodes with memory, memory stress can occur, and the kernel will try to perform memory reclamation. This leads to start failures in the duroslight service, with the node staying inactive.
35734	Due to a sorting algorithm misconfiguration, graceful powerup can take significantly longer than it should. In some cases, the ability to use DCPMM for MD can be disabled as a workaround for this issue.
35575	Volumes could remain in degraded state, after the node has recovered from network issues.
34975	Control plane is not able to perform management operations on a volume that was attempted to be deleted. If the deletion attempt was being handled while a CM switch occurred, the new CM issues messages indicating that the state of the volume is invalid, and is unable to recover or clean up the volume.
34970	api-service could become irresponsive if it loses connectivity to etcd during its startup. Workaround: Restart api-service using systemctl restart api-service.
34169	Duroslight crashes (segfault) during startup on Sapphire Rapids, when the kernel is in lockdown mode.
33998	A node could fail on GFTL assertion objects_info_digest != write_unit_md_objects_info_digest under the following conditions: At some earlier point in time, GFTL abrupt recovery occurred. Before this recovery, a snapshot was deleted, while more recent snapshots of the same volume and/or clones from this snapshot were present. Note that when using K8s or OpenStack orchestration, Lightbits CSI/Cinder drivers could create and delete snapshots as part of cloning a volume.
33865	In certain cases when migrating volumes during dynamic rebalancing, a VolumeInDegradedProtectionState event could be sent out when the volume is actually fully protected.
29683	Systems with Solidigm/Intel D5-P5316 drives may experience higher than expected write latency after several drive write cycles. Contact Lightbits Support if you use Solidigm/Intel D5-P5316 SSDs and are experiencing higher than expected write latency.
28027	A server upgrade status will not update in the following sequence: A server is upgraded to release x.y.z. The operation fails (i.e., times out); however, binaries on the server are updated to version x.y.z. At a later time, the upgrade is attempted again to version x.y.z (this operation is skipped internally, as binaries have already been updated). The upgrade status will continue to show the failed upgrade operation, even though the last upgrade returned with no error.
25382	Under the conditions below, the amount of storage occupied by cold units (filled with 4096 small objects), is not accounted for and not reported, which could result in reaching a storage full or almost full situation that is not observable in the node storage statistics: A sufficient amount of logical user storage contains highly compressible data; e.g., zeroes. This data has been written in large chunks over a short period of time. During this time, no or almost no user writes with lower compression rates or to uncompressed volumes. The highly compressed data written remains unmodified (cold); i.e., not overwritten by user writes for a long period of time. When such a situation occurs, the control plane software does not detect storage capacity reaching the threshold to start proactive rebalancing to free capacity. The System Administrator also relies on the same storage statistics the control plane exposes, and therefore cannot tell that the system capacity has reached the limit.
22582	A server could remain in "Enabling" state if the enable server command is issued during an upgrade.
19670	The compression ratio returned by get-cluster API will be incorrect when the cluster has snapshots created over volumes. The calculation of the compression ratio at the cluster level uses different logic for physical used capacity and the amount of uncompressed data written to storage. Hence the compression ratio value might be higher than the actual value. A correct indication of cluster level compression can be deduced from a weighted average of compression ratio at the node levels; i.e., Compression ratio = sum(node compression ratio * node physical usage) / sum(node physical usage).
18966	"lbcli list events" could fail with "received message larger than max" when there are events that contain a large amount of information. Workaround: Use the --limit and --since arguments to read a smaller amount of data at a time.
18948	The node local rebuild progress (due to SSD failure) shows 100% done when there is no storage space left to complete the rebuild.
18522	When attempting to add a server to a cluster using lbcli 'create server' or rest post '/api/v2/servers", and the operation fails for any reason, 'list servers' could permanently show the new server in 'creating' state.
18214	Automatic rebalancing features (fail-in-place and proactive-rebalance) should be disabled if enable_iptables is enabled during installation.
17329	Lightbits exposes latency information per request size. The time window for latency measurement is not synchronized with the measurement of nr read/write requests. Therefore a weighted average calculation of latency over all request sizes will result in inaccurate latency information.
17298	The migration of volumes due to automatic rebalancing could take time, even when volumes are empty.
15715	During a volume rebuild, the Grafana dashboard does not show the write IOs for the recovered data.
15037	With the IP Tables feature enabled, adding a new node requires opening the etcd ports for that node using the "lbcli create admin-endpoint" command.
14995	A single server cluster cannot be upgraded using the API. In order to upgrade, manually log into the server, stop the Lightbits services, run a yum update, and reboot.
14889	In case of an SSD failure, the system will scan the storage and rebuild the data. The entire raw capacity will be scanned, even when not all of it was utilized. This leads to a longer rebuild time than necessary.
14863	Prior to lb CSI installation, the lb discovery client service must be installed and started on all K8S cluster nodes.
14787	The Lightbits installation will fail on systems with NVDIMMs that do not support auto labels. Workaround: Log into the server and issue the following command: ndctl create-namespace -f -e namespace0.0 --type=pmem --mode=dax --no-autolabel
14212	OpenStack: Once a volume attach fails, the following attempts to attach it will also fail. Workaround: Remove the discovery-client configuration files for the failed volume and restart the discovery-client and Nova services.
13680	In a cluster deployed with a minimum of two replicas and when more than one node fails, after completing a rebuild for the three-replicas volume, this volume may stay in read-only mode if another node returns to active state at the same time.
13253	A local rebuild takes the same amount of time, independently of storage utilization.
13064	Following a 'replace node' operation, volumes with a single replica will be created as 'unavailable' in the new node. Note: Single replica volumes are not protected, and data will not move to the new node. Workaround: Delete single replica volumes before replacing the node, or reboot the new server after replacing the node.
12310	After a volume becomes unavailable due to failure of all replicas, it could take more than one replica to recover before the volume can be available again.
11856	Volume and node usage metrics might show different values between REST/lbcli and Prometheus, when a volume is deleted and a node is disconnected.
11326	Volume metrics do not return any value for volumes that are created but do not store any data.
10021	Commands affecting SSD content (such as blkdiscard, nvme format) should not be executed on the Lightbits server.

Last updated on

Was this page helpful?