Known Issues in Lightbits 3.9.6

ID	Description
39211	When deleting the most recent snapshot of a volume while a node holding a replica is offline, recently written data could revert to the data stored in that snapshot if the node later becomes the primary.
39168	When using DCPMM, if a snapshot is taken after an abrupt failure, recently written data could be reverted to the state captured in the snapshot.
38754	A node-manager service will fail to shut down gracefully, if the shutdown is issued before it successfully completed to power-up.
38497	When creating a new server to replace another server in the cluster (using the --extend-cluster=false flag - note that this is also the default), the new server will not participate in all proper distribution of replicas over the cluster and could cause an imbalance of resources.
38496	Creating a new server to replace another server in the cluster (using the --extend-cluster=false flag - note that this is also the default) - while dynamic rebalance is enabled - could cause the new server to participate in the dynamic rebalance process. The nodes on the server could then move automatically to active and not remain unattached. This prevents the server from acting as a replacement in the replace node process.
38043	If encryption was turned on but enabling it failed - resulting in the creation of an 'EnableServerEncryptionFailed' event - the API service will return stale events. Any event from that point onward that exists in the system will not be returned by the "ListEvents" API. As a workaround, check if this event exists before upgrading to 3.14/3.15.1. Note that a similar issue could also occur when a cluster has double disk failure on one of the servers (or single disk failure with no EC), and Lightbits 3.2.x or older was used at the time of failure.
37505	The volume statistic 'physicalOwnedCapacity' might report an incorrect value when data is overwritten at the same LBA block with a different length. This can occur when the overwritten data is compressed with a different compression ratio than the original. In such case, the length of the overwritten data is not accounted for in the statistic.
37395	In some rare racy conditions a server may remain stuck in a deleting state.
37205	Incorrect handling of IO errors from NVMe SSDs during abrupt recovery may cause node recovery to fail.
36882	GFTL service could fail locally due to a rare race condition when a SSD failure/removal, a SSD read submission, and multiple volume rebuilds all occur at exactly the same time.
36722	Users can reference the NVMe device by its path name (e.g., /dev/nvme0n1) - as used during the initial system setup - to determine the storage SSD used by servers in the Lightbits storage cluster. However, this could lead to data loss since device names are not persistent across reboots.
36282	When running the getHost API through REST, in certain cases getHost might produce 'not found', even if the host does exist. It is recommended to use the listHosts API with the hostNQN filter to get the proper results. Two initiators connecting with the same hostNQN are overriding each other in listHosts, showing only one of them. Removing hostNQN does not remove it from the subsystem, which means that on the first volume to connect with ALLOW_ANY, old hostNQNs of the initiator will be also listed.
36090	Due to a rare internal error, involved long network disconnections nodes might lose service and stay in Inactive state - even though the node should be active.
36089	Under rare situations involving stress on the cluster that includes rebalance activity accompanied by disconnections from etcd, node manager may crash and restart, or fail to complete rebuilds, and volumes may be stuck in Migrating state. A workaround if this happens is to restart the affected node manager.
35837	On a single-instance service on a machine with multiple numa-nodes with memory, memory stress can occur, and the kernel will try to perform memory reclamation. This leads to start failures in the duroslight service, with the node staying inactive.
35575	Volumes could remain in degraded state, after the node has recovered from network issues.
34169	Duroslight crashes (segfault) during startup on Sapphire Rapids, when the kernel is in lockdown mode.
28027	A server upgrade status will not update in the following sequence: A server is upgraded to release x.y.z. The operation fails (i.e., times out); however, binaries on the server are updated to version x.y.z. At a later time, the upgrade is attempted again to version x.y.z (this operation is skipped internally, as binaries have already been updated). The upgrade status will continue to show the failed upgrade operation, even though the last upgrade returned with no error.

Last updated on

Was this page helpful?