Known Issues in Lightbits 3.9.6

IDDescription
39211When deleting the most recent snapshot of a volume while a node holding a replica is offline, recently written data could revert to the data stored in that snapshot if the node later becomes the primary.
39168When using DCPMM, if a snapshot is taken after an abrupt failure, recently written data could be reverted to the state captured in the snapshot.
38754A node-manager service will fail to shut down gracefully, if the shutdown is issued before it successfully completed to power-up.
38497When creating a new server to replace another server in the cluster (using the --extend-cluster=false flag - note that this is also the default), the new server will not participate in all proper distribution of replicas over the cluster and could cause an imbalance of resources.
38496Creating a new server to replace another server in the cluster (using the --extend-cluster=false flag - note that this is also the default) - while dynamic rebalance is enabled - could cause the new server to participate in the dynamic rebalance process. The nodes on the server could then move automatically to active and not remain unattached. This prevents the server from acting as a replacement in the replace node process.
38043

If encryption was turned on but enabling it failed - resulting in the creation of an 'EnableServerEncryptionFailed' event - the API service will return stale events. Any event from that point onward that exists in the system will not be returned by the "ListEvents" API.

As a workaround, check if this event exists before upgrading to 3.14/3.15.1.

Note that a similar issue could also occur when a cluster has double disk failure on one of the servers (or single disk failure with no EC), and Lightbits 3.2.x or older was used at the time of failure.

37505The volume statistic 'physicalOwnedCapacity' might report an incorrect value when data is overwritten at the same LBA block with a different length. This can occur when the overwritten data is compressed with a different compression ratio than the original. In such case, the length of the overwritten data is not accounted for in the statistic.
37395In some rare racy conditions a server may remain stuck in a deleting state.
37205Incorrect handling of IO errors from NVMe SSDs during abrupt recovery may cause node recovery to fail.
36882GFTL service could fail locally due to a rare race condition when a SSD failure/removal, a SSD read submission, and multiple volume rebuilds all occur at exactly the same time.
36722Users can reference the NVMe device by its path name (e.g., /dev/nvme0n1) - as used during the initial system setup - to determine the storage SSD used by servers in the Lightbits storage cluster. However, this could lead to data loss since device names are not persistent across reboots.
36282
  1. When running the getHost API through REST, in certain cases getHost might produce 'not found', even if the host does exist. It is recommended to use the listHosts API with the hostNQN filter to get the proper results.
  2. Two initiators connecting with the same hostNQN are overriding each other in listHosts, showing only one of them.
  3. Removing hostNQN does not remove it from the subsystem, which means that on the first volume to connect with ALLOW_ANY, old hostNQNs of the initiator will be also listed.
36090Due to a rare internal error, involved long network disconnections nodes might lose service and stay in Inactive state - even though the node should be active.
36089Under rare situations involving stress on the cluster that includes rebalance activity accompanied by disconnections from etcd, node manager may crash and restart, or fail to complete rebuilds, and volumes may be stuck in Migrating state. A workaround if this happens is to restart the affected node manager.
35837On a single-instance service on a machine with multiple numa-nodes with memory, memory stress can occur, and the kernel will try to perform memory reclamation. This leads to start failures in the duroslight service, with the node staying inactive.
35575Volumes could remain in degraded state, after the node has recovered from network issues.
34169Duroslight crashes (segfault) during startup on Sapphire Rapids, when the kernel is in lockdown mode.
28027

A server upgrade status will not update in the following sequence:

  1. A server is upgraded to release x.y.z.
  2. The operation fails (i.e., times out); however, binaries on the server are updated to version x.y.z.
  3. At a later time, the upgrade is attempted again to version x.y.z (this operation is skipped internally, as binaries have already been updated).
  4. The upgrade status will continue to show the failed upgrade operation, even though the last upgrade returned with no error.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard