Issues Fixed in Lightbits 3.15.3

AI Tools
IDDescription
44298If a Journal NVMe device fails when the node manager is down, when the node manager comes back up, it can try to take another disk to use for journaling. If the matchers for journaling are the same as for data devices, it can take one of the data devices by mistake - causing the GFTL to crash. For journaling, it is necessary to use specific matchers that are different then the data devices (for example, the serial number).
43456In a rare combination of conditions, a storage node could fail to start after a restart if placement group membership changed while the node was offline. This requires memory pressure during a prior recovery (causing stale metadata to be persisted), followed by placement group rebalancing that removes volumes from the node. Normal operations and graceful recovery flows are not affected.
42963When SSD Journaling is enabled, if Duroslight fails due to a non-journal-related issue, the Node Manager (NM) might incorrectly classify the failure as a journal device failure, causing the NM to remain inactive and enter a Permanently Failed state. When SSD Journaling is disabled, a Duroslight failure does not impact the failure scenario. However, users might receive a spurious "journal device failed" event even when journaling is not in use. This is cosmetic only and does not reflect an actual journal issue.
42800A volume's protections state could fail to update correctly in some cases of network/ETCD unavailability.
42614Cluster manager and etcd services could suffer a very slow potential memory leak in rare cases. Mishandling of a deprecated GFTL data loss event could cause the event clean logic to stop cleanup of old events, leading to a continuous increase in the number of events stored on the cluster.
42309In certain situations, if there is CM failover during the initial KEK rotation process (race condition), the new CM may not be able to become active. This means that many APIs will fail. The data path will still work as long as all nodes are healthy.
42282A volume protection state might be reported incorrectly in the API as fully protected instead of degraded or read only, following a permanent failure re-balance that fails. The issue is limited to the protection state the API reports, but internally the protection state is handled as expected.
41162Deleting a snapshot while a node is inactive could cause a subsequent rebuild initiated from that node (acting as primary) to fail. This condition can occur when the inactive node retains metadata for the deleted snapshot while peer nodes do not. Full (migration) rebuilds are more likely to be impacted, as they could include objects associated with the affected snapshot. If this issue is encountered, contact Lightbits Support for an approved procedure to identify and release the problematic snapshots.
41068A node could crash when powering up from an abrupt failure in the rare case where the volume containing the most recently written data is deleted just before an NVMe device failure - as well as the system completing the full rebuild before any new writes are issued to any volume replicated on that node. If this occurs, the remediation is to either fail the node in place or contact Lightbits Support, who can perform an internal procedure to recover the node from this state.
40871Instances with seven or more devices and a specific configuration (eight SWLF cores and eight or more recovery cores) could fail graceful recovery and fall back to abrupt recovery, which can take significantly longer. Mitigation: Set the module parameter gracefulrecovery maxrecovery cores="6" in the gftl-options file.
40626Under an extremely rare race condition that can occur during background garbage collection while two successive snapshots are deleted, it is possible for data from an older volume snapshot to overwrite more recent data. A permanent fix for this issue is in development and will be included in a forthcoming release.
40208A volume rebuild could fail to complete following an internal error in the handling of creating a new snapshot. When a specific portion of the handling of a create snapshot task occurs exactly as the cluster manager service is switched over, this volume and other volumes that share the same protection group could get into an inconsistent state that will prevent the completion of a volume rebuild.
39742In certain scenarios, volumes protection state may fail to be updated correctly, due to an internal race condition that could lead to very temporary resource inconsistency that will fail the protection state update.
VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches