Known Issues in Lightbits 3.17.1

AI Tools

ID	Description
44673	In rare cases, performing a KEK rotation while simultaneously executing a high volume of volume/snapshot control plane operations could result in increased etcd contention. This affects only customers utilizing cluster-level encryption with the KEK rotation API.
44435	pmem_init and lbe executables are linked using staticx and therefore require /tmp to be mounted with the exec option enabled; otherwise, these executables will fail to run with "permission denied".
44383	When encryption is enabled, deleting a server that hosts the Active CM could prevent subsequent server additions from completing successfully. To avoid this, stop the CM service on the removed server (or shut down the server) after it has been removed from the cluster.
44298	If a Journal NVMe device fails when the node manager is down, when the node manager comes back up, it can try to take another disk to use for journaling. If the matchers for journaling are the same as for data devices, it can take one of the data devices by mistake - causing the GFTL to crash. For journaling, it is necessary to use specific matchers that are different then the data devices (for example, the serial number).
42991	A network disconnect that might coincide exactly with a change of state of a NVMe SSD device could prevent correct updates of future changes of this specific NVMe SSD device state (the issue will resolve itself the next time node-manager service is restarted).
42963	When SSD Journaling is enabled, if Duroslight fails due to a non-journal-related issue, the Node Manager (NM) might incorrectly classify the failure as a journal device failure, causing the NM to remain inactive and enter a Permanently Failed state. When SSD Journaling is disabled, a Duroslight failure does not impact the failure scenario. However, users might receive a spurious "journal device failed" event even when journaling is not in use. This is cosmetic only and does not reflect an actual journal issue.
42614	Cluster manager and etcd services could suffer a very slow potential memory leak in rare cases. Mishandling of a deprecated GFTL data loss event could cause the event clean logic to stop cleanup of old events, leading to a continuous increase in the number of events stored on the cluster.
42309	In rare cases, a CM failover occurring during the initial KEK rotation process may result in a race condition where the new CM fails to become active, causing some API calls to fail. The data path remains unaffected as long as all nodes are healthy.
42282	A volume protection state might be reported incorrectly in the API as fully protected instead of degraded or read only, following a permanent failure re-balance that fails. The issue is limited to the protection state the API reports, but internally the protection state is handled as expected.
41873	Under a specific race condition, if a snapshot is created while a node is down and subsequently deleted during a very precise window in the node's startup sequence, the node may become unavailable.
41466	Creating a snapshot with a retention time greater than 192 years will fail and cause the API service to restart.
41162	Deleting a snapshot while a node is inactive could cause a subsequent rebuild initiated from that node (acting as primary) to fail. This condition can occur when the inactive node retains metadata for the deleted snapshot while peer nodes do not. Full (migration) rebuilds are more likely to be impacted, as they could include objects associated with the affected snapshot. If this issue is encountered, contact Lightbits Support for an approved procedure to identify and release the problematic snapshots.
41095	The NodeRebuildNotPossible alert may not trigger under conditions where it should, resulting in missed notifications for rebuild-blocking scenarios.
41068	A node could crash when powering up from an abrupt failure in the rare case where the volume containing the most recently written data is deleted just before an NVMe device failure - as well as the system completing the full rebuild before any new writes are issued to any volume replicated on that node. If this occurs, the remediation is to either fail the node in place or contact Lightbits Support, who can perform an internal procedure to recover the node from this state.
40883	When using VCP to upgrade a cluster to Lightbits 3.17.1 or later, the upgrade will fail as VCP cannot parse the updated version format. To complete the upgrade successfully, use the Lightbits core CLI or REST API directly.
40607	In a specific edge case, if the Duroslight fails to write to the Journal device during a rebuild, Duroslight might crash without producing a Journal SSD Failed Event. In such cases, only a NodeInactive event may be recorded.
40428	In extremely rare cases, the reported logical size of a volume could be incorrect after a discard operation is performed and TRIM support was enabled.
40293	When an admin-endpoint is deleted or updated, the corresponding iptables rules created for it remain in place. As a result, the related ports stay open even though the admin-endpoint has been deleted or updated. The iptables configuration is refreshed only after a service restart, instead of being properly updated in real time.
40068	In rare cases, a newly-created volume could be assigned the same NSID as an existing volume. This condition can lead to incorrect delete or update operations for volumes sharing the same NSID. If this issue is encountered, contact Lightbits Support for a manual remediation procedure to identify and fix the affected volumes.
39951	A temporary issue - such as a brief network glitch occurring during a specific short window in the node power-up process - could prevent the node from completing the power-up successfully. If this issue occurs, contact Lightbits Support for assistance.
39742	In rare certain circumstances, volumes protection state may fail to be updated correctly, due to an internal race condition that could lead to very temporary resource inconsistency that will fail the protection state update.
39184	When TRIM is enabled and a user performs the discard operation, the logical report size might be incorrect and not reflect the true logical size.
38706	In some rare cases, Duroslight could hang during shutdown.
38497	When creating a new server to replace another server in the cluster using the --extend-cluster=false flag (which is the default setting), and at a much later time this server and its node experience a permanent failure and fail in place is enabled (causing the servers resources to be migrated), if the server goes active again it might not participate in all proper distribution of replicas over the cluster and could cause an imbalance of resources.
37830	In extremely rare cases, a node may not recover to an active state if an I/O error or bad block is encountered on an underlying SSD during its startup sequence. This prevents a key service (gftl) from initializing correctly and may require manual intervention (such as removing the failed SSD from the system) to allow the node to complete its recovery successfully.
37505	In a rare combination of events, the 'physicalOwnedCapacity' volume statistic may report an incorrect value when data at a specific LBA is overwritten with content that has a different compression ratio. In this scenario, the updated length of the overwritten data is not correctly reflected in the statistic.
28027	In specific circumstances, the server upgrade status may not reflect the correct state after a successful retry. This occurs in the following sequence: 1. A server is upgraded to version x.y.z. 2. The operation fails (for example, due to a timeout); however, the server binaries are successfully updated to version x.y.z. 3. The upgrade is retried to version x.y.z — this step is skipped internally, as the binaries are already up to date. 4. The upgrade status continues to show the previous failed operation, even though the retry completed without error.

Last updated on

Was this page helpful?