Lightbits Release Documentation
3.19.x
Lightbits Release Notes
Lightbits Known Issues
Title
Message
Create new category
What is the title of your new category?
Edit page index title
What is the title of the page index?
Edit category
What is the new title of your category?
Edit link
What is the new title and URL of your link?
Known Issues in Lightbits 3.13.1
Copy Markdown
Open in ChatGPT
Open in Claude
| ID | Description |
|---|---|
| 44673 | In rare cases, performing a KEK rotation while simultaneously executing a high volume of volume/snapshot control plane operations could result in increased etcd contention. This affects only customers utilizing cluster-level encryption with the KEK rotation API. |
| 44383 | When encryption is enabled, deleting a server that hosts the Active CM could prevent subsequent server additions from completing successfully. To avoid this, stop the CM service on the removed server (or shut down the server) after it has been removed from the cluster. |
| 42614 | Cluster manager and etcd services could suffer a very slow potential memory leak in rare cases. Mishandling of a deprecated GFTL data loss event could cause the event clean logic to stop cleanup of old events, leading to a continuous increase in the number of events stored on the cluster. |
| 41466 | Creating a snapshot with a retention time greater than 192 years will fail and cause the API service to restart. |
| 41162 | Deleting a snapshot while a node is inactive could cause a subsequent rebuild initiated from that node (acting as primary) to fail. This condition can occur when the inactive node retains metadata for the deleted snapshot while peer nodes do not. Full (migration) rebuilds are more likely to be impacted, as they could include objects associated with the affected snapshot. If this issue is encountered, contact Lightbits Support for an approved procedure to identify and release the problematic snapshots. |
| 41068 | A node could crash when powering up from an abrupt failure in the rare case where the volume containing the most recently written data is deleted just before an NVMe device failure - as well as the system completing the full rebuild before any new writes are issued to any volume replicated on that node. If this occurs, the remediation is to either fail the node in place or contact Lightbits Support, who can perform an internal procedure to recover the node from this state. |
| 40068 | In rare cases, a newly-created volume could be assigned the same NSID as an existing volume. This condition can lead to incorrect delete or update operations for volumes sharing the same NSID. If this issue is encountered, contact Lightbits Support for a manual remediation procedure to identify and fix the affected volumes. |
| 39628 | To prevent a rare potential Machine Check Exception (MCE) and forced reboots on Sapphire Rapids machines, we recommend disabling the DSA offload feature. This condition can occur if the duroslight log indicates "Enabling DSA crc32 offload for reads," and can be prevented by adding dsa_read_crc32: false and dsa_write_crc32: false under the "configurator" section of /etc/duroslight/conf.yaml. |
| 39211 | When deleting the most recent snapshot of a volume while a node holding a replica is offline, recently written data could revert to the data stored in that snapshot if the node later becomes the primary. |
| 38754 | A node-manager service will fail to shut down gracefully, if the shutdown is issued before it successfully completed to power-up. |
| 38497 | When creating a new server to replace another server in the cluster using the --extend-cluster=false flag (which is the default setting), and at a much later time this server and its node experience a permanent failure and fail in place is enabled (causing the servers resources to be migrated), if the server goes active again it might not participate in all proper distribution of replicas over the cluster and could cause an imbalance of resources. |
| 37831 | In some cases, silent data corruption on an SSD could cause a node crash instead of attempting to recover the data and reporting an event. This can occur if the SSD returns invalid data rather than an I/O error. |
| 37830 | In extremely rare cases, a node may not recover to an active state if an I/O error or bad block is encountered on an underlying SSD during its startup sequence. This prevents a key service (gftl) from initializing correctly and may require manual intervention (such as removing the failed SSD from the system) to allow the node to complete its recovery successfully. |
| 37505 | In a rare combination of events, the 'physicalOwnedCapacity' volume statistic may report an incorrect value when data at a specific LBA is overwritten with content that has a different compression ratio. In this scenario, the updated length of the overwritten data is not correctly reflected in the statistic. |
| 37395 | In some rare racy conditions a server may remain stuck in a deleting state. |
| 37205 | Incorrect handling of IO errors from NVMe SSDs during abrupt recovery may cause node recovery to fail. |
| 37114 | In releases 3.13.1 and 3.14.1, the node-manager could fail to start, causing a Lightbits node to fail to come up. This can occur when the following conditions are all true: 1. A node is configured such that Lightbits will use zero SSD drives in NUMA 1. 2. Lightbits is configured to use one or more SSD drives in NUMA 2. 3. Lightbits is configured to use a single instance across multiple NUMA nodes. 4. The allowCrossNuma flag is set to false. If the node-manager fails to start due to the above, contact Lightbits Support. |
| 36882 | GFTL service could fail locally due to a rare race condition when a SSD failure/removal, a SSD read submission, and multiple volume rebuilds all occur at exactly the same time. |
| 36722 | device matchers do not support specyfing exact NVMe device by path name, this attribute is not persistent across reboots. This then could lead to data loss since the wrong device may be used. |
| 36515 | Due to changes in the OpenSSL version used in the Lightbits front end, there is a degradation of ~30% in read IO throughput when the data is encrypted in version 3.13.1. |
| 36282 |
|
| 33865 | In certain cases when migrating volumes during dynamic rebalancing, a VolumeInDegradedProtectionState event could be sent out when the volume is actually fully protected. |
| 29683 | Systems with Solidigm/Intel D5-P5316 drives may experience higher than expected write latency after several drive write cycles. Contact Lightbits Support if you use Solidigm/Intel D5-P5316 SSDs and are experiencing higher than expected write latency. |
| 28027 | In specific circumstances, the server upgrade status may not reflect the correct state after a successful retry. This occurs in the following sequence:
|
| 25382 | Under the conditions below, the amount of storage occupied by cold units (filled with 4096 small objects), is not accounted for and not reported, which could result in reaching a storage full or almost full situation that is not observable in the node storage statistics: - A sufficient amount of logical user storage contains highly compressible data; e.g., zeroes. - This data has been written in large chunks over a short period of time. - During this time, no or almost no user writes with lower compression rates or to uncompressed volumes. - The highly compressed data written remains unmodified (cold); i.e., not overwritten by user writes for a long period of time. When such a situation occurs, the control plane software does not detect storage capacity reaching the threshold to start proactive rebalancing to free capacity. The System Administrator also relies on the same storage statistics the control plane exposes, and therefore cannot tell that the system capacity has reached the limit. |
| 22582 | A server could remain in "Enabling" state if the enable server command is issued during an upgrade. |
| 19670 | The compression ratio returned by get-cluster API will be incorrect when the cluster has snapshots created over volumes. The calculation of the compression ratio at the cluster level uses different logic for physical used capacity and the amount of uncompressed data written to storage. Hence the compression ratio value might be higher than the actual value. A correct indication of cluster level compression can be deduced from a weighted average of compression ratio at the node levels; i.e., Compression ratio = sum(node compression ratio * node physical usage) / sum(node physical usage). |
| 18966 | "lbcli list events" could fail with "received message larger than max" when there are events that contain a large amount of information. Workaround: Use the --limit and --since arguments to read a smaller amount of data at a time. |
| 18948 | The node local rebuild progress (due to SSD failure) shows 100% done when there is no storage space left to complete the rebuild. |
| 18522 | When attempting to add a server to a cluster using lbcli 'create server' or rest post '/api/v2/servers", and the operation fails for any reason, 'list servers' could permanently show the new server in 'creating' state. |
| 18214 | Automatic rebalancing features (fail-in-place and proactive-rebalance) should be disabled if enable_iptables is enabled during installation. |
| 17298 | The migration of volumes due to automatic rebalancing could take time, even when volumes are empty. |
| 15715 | During a volume rebuild, the Grafana dashboard does not show the write IOs for the recovered data. |
| 14995 | A single server cluster cannot be upgraded using the cluster upgrade command. Upgrade using only the upgrade server command. |
| 14889 | In case of an SSD failure, the system will scan the storage and rebuild the data. The entire raw capacity will be scanned, even when not all of it was utilized. This leads to a longer rebuild time than necessary. |
| 14863 | Prior to lb CSI installation, the lb discovery client service must be installed and started on all K8S cluster nodes. |
| 14212 | OpenStack: Once a volume attach fails, the following attempts to attach it will also fail. Workaround: Remove the discovery-client configuration files for the failed volume and restart the discovery-client and Nova services. |
| 13680 | In a cluster deployed with a minimum of two replicas and when more than one node fails, after completing a rebuild for the three-replicas volume, this volume may stay in read-only mode if another node returns to active state at the same time. |
| 13253 | A local rebuild takes the same amount of time, independently of storage utilization. |
| 13064 | Following a 'replace node' operation, volumes with a single replica will be created as 'unavailable' in the new node. Note: Single replica volumes are not protected, and data will not move to the new node. Workaround: Delete single replica volumes before replacing the node, or reboot the new server after replacing the node. |
| 12310 | After a volume becomes unavailable due to failure of all replicas, it could take more than one replica to recover before the volume can be available again. |
| 11856 | Volume and node usage metrics might show different values between REST/lbcli and Prometheus, when a volume is deleted and a node is disconnected. |
| 11326 | Volume metrics do not return any value for volumes that are created but do not store any data. |
| 10021 | Commands affecting SSD content (such as blkdiscard, nvme format) should not be executed on the Lightbits server. |
VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches
Last updated on
Was this page helpful?
Next to read:
Issues Fixed in Lightbits 3.13.1© 2026 Lightbits Labs™
Discard Changes
Do you want to discard your current changes and overwrite with the template?
Archive Synced Block
Message
Create new Template
What is this template's title?
Delete Template
Message