Known Issues in Lightbits 3.13.1

IDDescription
44673In rare cases, performing a KEK rotation while simultaneously executing a high volume of volume/snapshot control plane operations could result in increased etcd contention. This affects only customers utilizing cluster-level encryption with the KEK rotation API.
44383When encryption is enabled, deleting a server that hosts the Active CM could prevent subsequent server additions from completing successfully. To avoid this, stop the CM service on the removed server (or shut down the server) after it has been removed from the cluster.
42614Cluster manager and etcd services could suffer a very slow potential memory leak in rare cases. Mishandling of a deprecated GFTL data loss event could cause the event clean logic to stop cleanup of old events, leading to a continuous increase in the number of events stored on the cluster.
41466Creating a snapshot with a retention time greater than 192 years will fail and cause the API service to restart.
41162Deleting a snapshot while a node is inactive could cause a subsequent rebuild initiated from that node (acting as primary) to fail. This condition can occur when the inactive node retains metadata for the deleted snapshot while peer nodes do not. Full (migration) rebuilds are more likely to be impacted, as they could include objects associated with the affected snapshot. If this issue is encountered, contact Lightbits Support for an approved procedure to identify and release the problematic snapshots.
41068A node could crash when powering up from an abrupt failure in the rare case where the volume containing the most recently written data is deleted just before an NVMe device failure - as well as the system completing the full rebuild before any new writes are issued to any volume replicated on that node. If this occurs, the remediation is to either fail the node in place or contact Lightbits Support, who can perform an internal procedure to recover the node from this state.
40068In rare cases, a newly-created volume could be assigned the same NSID as an existing volume. This condition can lead to incorrect delete or update operations for volumes sharing the same NSID. If this issue is encountered, contact Lightbits Support for a manual remediation procedure to identify and fix the affected volumes.
39628To prevent a rare potential Machine Check Exception (MCE) and forced reboots on Sapphire Rapids machines, we recommend disabling the DSA offload feature. This condition can occur if the duroslight log indicates "Enabling DSA crc32 offload for reads," and can be prevented by adding dsa_read_crc32: false and dsa_write_crc32: false under the "configurator" section of /etc/duroslight/conf.yaml.
39211When deleting the most recent snapshot of a volume while a node holding a replica is offline, recently written data could revert to the data stored in that snapshot if the node later becomes the primary.
38754A node-manager service will fail to shut down gracefully, if the shutdown is issued before it successfully completed to power-up.
38497When creating a new server to replace another server in the cluster using the --extend-cluster=false flag (which is the default setting), and at a much later time this server and its node experience a permanent failure and fail in place is enabled (causing the servers resources to be migrated), if the server goes active again it might not participate in all proper distribution of replicas over the cluster and could cause an imbalance of resources.
37831In some cases, silent data corruption on an SSD could cause a node crash instead of attempting to recover the data and reporting an event. This can occur if the SSD returns invalid data rather than an I/O error.
37830In extremely rare cases, a node may not recover to an active state if an I/O error or bad block is encountered on an underlying SSD during its startup sequence. This prevents a key service (gftl) from initializing correctly and may require manual intervention (such as removing the failed SSD from the system) to allow the node to complete its recovery successfully.
37505In a rare combination of events, the 'physicalOwnedCapacity' volume statistic may report an incorrect value when data at a specific LBA is overwritten with content that has a different compression ratio. In this scenario, the updated length of the overwritten data is not correctly reflected in the statistic.
37395In some rare racy conditions a server may remain stuck in a deleting state.
37205Incorrect handling of IO errors from NVMe SSDs during abrupt recovery may cause node recovery to fail.
37114In releases 3.13.1 and 3.14.1, the node-manager could fail to start, causing a Lightbits node to fail to come up. This can occur when the following conditions are all true: 1. A node is configured such that Lightbits will use zero SSD drives in NUMA 1. 2. Lightbits is configured to use one or more SSD drives in NUMA 2. 3. Lightbits is configured to use a single instance across multiple NUMA nodes. 4. The allowCrossNuma flag is set to false. If the node-manager fails to start due to the above, contact Lightbits Support.
36882GFTL service could fail locally due to a rare race condition when a SSD failure/removal, a SSD read submission, and multiple volume rebuilds all occur at exactly the same time.
36722device matchers do not support specyfing exact NVMe device by path name, this attribute is not persistent across reboots. This then could lead to data loss since the wrong device may be used.
36515Due to changes in the OpenSSL version used in the Lightbits front end, there is a degradation of ~30% in read IO throughput when the data is encrypted in version 3.13.1.
36282
  1. When running the getHost API through REST, in certain cases getHost might produce 'not found', even if the host does exist. It is recommended to use the listHosts API with the hostNQN filter to get the proper results.
  2. Two initiators connecting with the same hostNQN are overriding each other in listHosts, showing only one of them.
  3. Renaming hostNQN requires 'rm /etc/discovery-client/internal/internal.json'
33865In certain cases when migrating volumes during dynamic rebalancing, a VolumeInDegradedProtectionState event could be sent out when the volume is actually fully protected.
29683Systems with Solidigm/Intel D5-P5316 drives may experience higher than expected write latency after several drive write cycles. Contact Lightbits Support if you use Solidigm/Intel D5-P5316 SSDs and are experiencing higher than expected write latency.
28027

In specific circumstances, the server upgrade status may not reflect the correct state after a successful retry. This occurs in the following sequence:

  1. A server is upgraded to version x.y.z.
  2. The operation fails (for example, due to a timeout); however, the server binaries are successfully updated to version x.y.z.
  3. The upgrade is retried to version x.y.z — this step is skipped internally, as the binaries are already up to date.
  4. The upgrade status continues to show the previous failed operation, even though the retry completed without error.
25382Under the conditions below, the amount of storage occupied by cold units (filled with 4096 small objects), is not accounted for and not reported, which could result in reaching a storage full or almost full situation that is not observable in the node storage statistics: - A sufficient amount of logical user storage contains highly compressible data; e.g., zeroes. - This data has been written in large chunks over a short period of time. - During this time, no or almost no user writes with lower compression rates or to uncompressed volumes. - The highly compressed data written remains unmodified (cold); i.e., not overwritten by user writes for a long period of time. When such a situation occurs, the control plane software does not detect storage capacity reaching the threshold to start proactive rebalancing to free capacity. The System Administrator also relies on the same storage statistics the control plane exposes, and therefore cannot tell that the system capacity has reached the limit.
22582A server could remain in "Enabling" state if the enable server command is issued during an upgrade.
19670The compression ratio returned by get-cluster API will be incorrect when the cluster has snapshots created over volumes. The calculation of the compression ratio at the cluster level uses different logic for physical used capacity and the amount of uncompressed data written to storage. Hence the compression ratio value might be higher than the actual value. A correct indication of cluster level compression can be deduced from a weighted average of compression ratio at the node levels; i.e., Compression ratio = sum(node compression ratio * node physical usage) / sum(node physical usage).
18966"lbcli list events" could fail with "received message larger than max" when there are events that contain a large amount of information. Workaround: Use the --limit and --since arguments to read a smaller amount of data at a time.
18948The node local rebuild progress (due to SSD failure) shows 100% done when there is no storage space left to complete the rebuild.
18522When attempting to add a server to a cluster using lbcli 'create server' or rest post '/api/v2/servers", and the operation fails for any reason, 'list servers' could permanently show the new server in 'creating' state.
18214Automatic rebalancing features (fail-in-place and proactive-rebalance) should be disabled if enable_iptables is enabled during installation.
17298The migration of volumes due to automatic rebalancing could take time, even when volumes are empty.
15715During a volume rebuild, the Grafana dashboard does not show the write IOs for the recovered data.
14995A single server cluster cannot be upgraded using the cluster upgrade command. Upgrade using only the upgrade server command.
14889In case of an SSD failure, the system will scan the storage and rebuild the data. The entire raw capacity will be scanned, even when not all of it was utilized. This leads to a longer rebuild time than necessary.
14863Prior to lb CSI installation, the lb discovery client service must be installed and started on all K8S cluster nodes.
14212OpenStack: Once a volume attach fails, the following attempts to attach it will also fail. Workaround: Remove the discovery-client configuration files for the failed volume and restart the discovery-client and Nova services.
13680In a cluster deployed with a minimum of two replicas and when more than one node fails, after completing a rebuild for the three-replicas volume, this volume may stay in read-only mode if another node returns to active state at the same time.
13253A local rebuild takes the same amount of time, independently of storage utilization.
13064Following a 'replace node' operation, volumes with a single replica will be created as 'unavailable' in the new node. Note: Single replica volumes are not protected, and data will not move to the new node. Workaround: Delete single replica volumes before replacing the node, or reboot the new server after replacing the node.
12310After a volume becomes unavailable due to failure of all replicas, it could take more than one replica to recover before the volume can be available again.
11856Volume and node usage metrics might show different values between REST/lbcli and Prometheus, when a volume is deleted and a node is disconnected.
11326Volume metrics do not return any value for volumes that are created but do not store any data.
10021Commands affecting SSD content (such as blkdiscard, nvme format) should not be executed on the Lightbits server.
VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches