Known Issues in Lightbits 3.15.1

AI Tools

ID	Description
44673	In rare cases, performing a KEK rotation while simultaneously executing a high volume of volume/snapshot control plane operations could result in increased etcd contention. This affects only customers utilizing cluster-level encryption with the KEK rotation API.
44435	pmem_init and lbe executables are linked using staticx and therefore require /tmp to be mounted with the exec option enabled; otherwise, these executables will fail to run with "permission denied".
44383	When encryption is enabled, deleting a server that hosts the Active CM could prevent subsequent server additions from completing successfully. To avoid this, stop the CM service on the removed server (or shut down the server) after it has been removed from the cluster.
42614	Cluster manager and etcd services could suffer a very slow potential memory leak. Mishandling of a deprecated GFTL data loss event could cause the event clean logic to stop cleanup of old events, leading to a continuous increase in the number of events stored on the cluster.
42309	In certain situations, if there is CM failover during the initial KEK rotation process (race condition), the new CM may not be able to become active. This means that many APIs will fail. The data path will still work as long as all nodes are healthy.
42282	A volume protection state might be reported incorrectly in API as fully protected instead of degraded or read only, following permanent failure re-balance that fails. The issue is limited to the protection state the API reports, but internally the protection state is handled as expected
41873	Under a specific race condition, if a snapshot is created while a node is down and then deleted during the node’s startup at a very precise timing, the node could become unavailable.
41466	Creating a snapshot with a retention time that exceeds 192 years will fail, and a restart of the api-service.
41162	Deleting a snapshot while a node is inactive could cause a subsequent rebuild initiated from that node (acting as primary) to fail. This condition can occur when the inactive node retains metadata for the deleted snapshot while peer nodes do not. Full (migration) rebuilds are more likely to be impacted, as they could include objects associated with the affected snapshot. If this issue is encountered, contact Lightbits Support for an approved procedure to identify and release the problematic snapshots.
41095	The NodeRebuildNotPossible alert is not triggered as expected.
40883	In the specific case of using VCP to upgrade a cluster, the upgrade to Lightbits version 3.17.1 or higher will fail because VCP cannot parse the new version format. To successfully upgrade the cluster, use the Lightbits core CLI or REST API directly.
40293	When an admin-endpoint is deleted or updated, the corresponding iptables rules created for it remain in place. As a result, the related ports stay open even though the admin-endpoint has been deleted or updated. The iptables configuration is refreshed only after a service restart, instead of being properly updated in real time.
40068	In rare cases, a newly-created volume could be assigned the same NSID as an existing volume. This condition can lead to incorrect delete or update operations for volumes sharing the same NSID. If this issue is encountered, contact Lightbits Support for a manual remediation procedure to identify and fix the affected volumes.
39742	Volumes protection state may fail to be updated correctly, in certain scenarios due to an internal race condition that could lead to very temporary resource inconsistency that will fail protection state update.
39628	To prevent a rare potential Machine Check Exception (MCE) and forced reboots on Sapphire Rapids machines, we recommend disabling the DSA offload feature. This condition can occur if the duroslight log indicates "Enabling DSA crc32 offload for reads," and can be prevented by adding dsa_read_ crc32: false and dsa_write_crc32: false under the "configurator" section of /etc/duroslight/conf.yaml.
39211	When deleting the most recent snapshot of a volume while a node holding a replica is offline, recently written data could revert to the data stored in that snapshot if the node later becomes the primary.
38497	When creating a new server to replace another server in the cluster (using the --extend-cluster=false flag - note that this is also the default), the new server will not participate in all proper distribution of replicas over the cluster and could cause an imbalance of resources.
38043	For 3.14 & 3.15.1 If encryption was turned on but enabling it failed - resulting in the creation of an 'EnableServerEncryptionFailed' event - the API service will return stale events. Any event from that point onward that exists in the system will not be returned by the "ListEvents" API. As a workaround, check if this event exists before upgrading to 3.14/3.15.1. Note that a similar issue could also occur when a cluster has double disk failure on one of the servers (or single disk failure with no EC), and Lightbits 3.2.x or older was used at the time of failure.
37852	List connected hosts could return hosts that are not connected to a volume when a volume uses IP-ACL. Listing the connected hosts with a volume filter could return a host that is connected to the nodes the volume is replicated to (if other volumes exist that have ACL or IP-ACL matching this host, running over the same nodes) - even if this volume's IP-ACL does not match this host.
37831	In some cases, silent data corruption on an SSD could cause a node crash instead of attempting to recover the data and reporting an event. This can occur if the SSD returns invalid data rather than an I/O error.
37830	In a very rare case, a node could fail to recover and return to an active state if an I/O error or bad block is encountered on an underlying SSD during its startup sequence. This issue prevents a key service (gftl) from initializing correctly and could require manual intervention - such as the removal of the failed SSD from the system - to allow the node to successfully complete its recovery.
37738	Under certain scenarios, Lightbits will cause the grub package to be updated during Lightbits installation, including the addition of new servers. On RHEL8 and derivatives, after updating Grub from "grub2-2.02-162.el8_10" to "grub2-2.02-165.el8_ 10", if the system is using BIOS mode it might enter the "grub rescue>" prompt upon booting. When this happens, see https://access.redhat.com/solutions/7118853 for how to restore system boot to normal operation.
37505	The volume statistic 'physicalOwnedCapacity' might report an incorrect value when data is overwritten at the same LBA block with a different length. This can occur when the overwritten data is compressed with a different compression ratio than the original. In such cases, the length of the overwritten data is not accounted for in the statistic.
37187	In some rare cases duroslight process shutdown may hang and never reach `configurator - Duroslight stopped` causing systemd timeout in process shutdown.
29683	Systems with Solidigm/Intel D5-P5316 drives may experience higher than expected write latency after several drive write cycles. Contact Lightbits Support if you use Solidigm/Intel D5-P5316 SSDs and are experiencing higher than expected write latency.
28027	A server upgrade status will not update in the following sequence: A server is upgraded to release x.y.z. The operation fails (i.e., times out); however, binaries on the server are updated to version x.y.z. At a later time, the upgrade is attempted again to version x.y.z (this operation is skipped internally, as binaries have already been updated). The upgrade status will continue to show the failed upgrade operation, even though the last upgrade returned with no error.
25382	Under the conditions below, the amount of storage occupied by cold units (filled with 4096 small objects), is not accounted for and not reported, which could result in reaching a storage full or almost full situation that is not observable in the node storage statistics: A sufficient amount of logical user storage contains highly compressible data; e.g., zeroes. This data has been written in large chunks over a short period of time. During this time, no or almost no user writes with lower compression rates or to uncompressed volumes. The highly compressed data written remains unmodified (cold); i.e., not overwritten by user writes for a long period of time. When such a situation occurs, the control plane software does not detect storage capacity reaching the threshold to start proactive rebalancing to free capacity. The System Administrator also relies on the same storage statistics the control plane exposes, and therefore cannot tell that the system capacity has reached the limit.
22582	A server could remain in "Enabling" state if the enable server command is issued during an upgrade.
19670	The compression ratio returned by get-cluster API will be incorrect when the cluster has snapshots created over volumes. The calculation of the compression ratio at the cluster level uses different logic for physical used capacity and the amount of uncompressed data written to storage. Hence the compression ratio value might be higher than the actual value. A correct indication of cluster level compression can be deduced from a weighted average of compression ratio at the node levels; i.e., Compression ratio = sum(node compression ratio * node physical usage) / sum(node physical usage).
18966	"lbcli list events" could fail with "received message larger than max" when there are events that contain a large amount of information. Workaround: Use the --limit and --since arguments to read a smaller amount of data at a time.
18948	The node local rebuild progress (due to SSD failure) shows 100% done when there is no storage space left to complete the rebuild.
18522	When attempting to add a server to a cluster using lbcli 'create server' or rest post '/api/v2/servers", and the operation fails for any reason, 'list servers' could permanently show the new server in 'creating' state.
18214	Automatic rebalancing features (fail-in-place and proactive-rebalance) should be disabled if enable_iptables is enabled during installation.
17298	The migration of volumes due to automatic rebalancing could take time, even when volumes are empty.
15715	During a volume rebuild, the Grafana dashboard does not show the write IOs for the recovered data.
15037	With the IP Tables feature enabled, adding a new node requires opening the etcd ports for that node using the "lbcli create admin-endpoint" command.
14995	A single server cluster cannot be upgraded using the API. In order to upgrade, manually log into the server, stop the Lightbits services, run a yum update, and reboot.
14889	In case of an SSD failure, the system will scan the storage and rebuild the data. The entire raw capacity will be scanned, even when not all of it was utilized. This leads to a longer rebuild time than necessary.
14863	Prior to lb CSI installation, the lb discovery client service must be installed and started on all K8S cluster nodes.
14212	OpenStack: Once a volume attach fails, the following attempts to attach it will also fail. Workaround: Remove the discovery-client configuration files for the failed volume and restart the discovery-client and Nova services.
13064	Following a 'replace node' operation, volumes with a single replica will be created as 'unavailable' in the new node. Note: Single replica volumes are not protected, and data will not move to the new node. Workaround: Delete single replica volumes before replacing the node, or reboot the new server after replacing the node.
11856	Volume and node usage metrics might show different values between REST/lbcli and Prometheus, when a volume is deleted and a node is disconnected.
11326	Volume metrics do not return any value for volumes that are created but do not store any data.
10021	Commands affecting SSD content (such as blkdiscard, nvme format) should not be executed on the Lightbits server.

Last updated on

Was this page helpful?