Known Issues in Lightbits 3.0.1
ID | Description |
---|---|
34970 | api-service could become irresponsive if it loses connectivity to etcd during its startup. Workaround: Restart api-service using systemctl restart api-service. |
29683 | Systems with Solidigm/Intel D5-P5316 drives may experience higher than expected write latency after several drive write cycles. Contact Lightbits Support if you use Solidigm/Intel D5-P5316 SSDs and are experiencing higher than expected write latency. |
22995 | discovery-client init-connection does not specify a timeout. In case of a failure in a discovery service on one of the targets, discovery-client might not abort the faulty discovery-service connection and will not fail over to another functioning service. |
22994 | Discovery-client does not time out on submission/completion of NVMe commands, if they never get a response from the discovery-service. |
22993 | Discovery-client fails to update referrals and cache entries when getting a list of bad entries. |
22582 | A server could remain in "Enabling" state if the enable server command is issued during an upgrade. |
22341 | The "create credential" command is not working using lbcli in this release. It can only be invoked using REST API. |
22298 | A node could fail to power up if there is some very old cold data. In this case you will need to upgrade to the latest Lightbits release. |
21415 | The QoS policy feature is currently at beta level. In some cases, setting a bandwidth limit may result in a much lower actual limit. |
19670 | The compression ratio returned by get-cluster API will be incorrect when the cluster has snapshots created over volumes. The calculation of the compression ratio at the cluster level uses different logic for physical used capacity and the amount of uncompressed data written to storage. Hence the compression ratio value might be higher than the actual value. A correct indication of cluster level compression can be deduced from a weighted average of compression ratio at the node levels; i.e., Compression ratio = sum(node compression ratio * node physical usage) / sum(node physical usage). |
19181 | When running "list events" with "limit" and "until" or "since", the "limit" takes precedence over "since" or "until" - and potentially less events are returned than possible. In order to see more events, either do not use "limit", or set it to a higher value. |
18966 | "lbcli list events" could fail with "received message larger than max" when there are events that contain a large amount of information. Workaround: Use the --limit and --since arguments to read a smaller amount of data at a time. |
18948 | The node local rebuild progress (due to SSD failure) shows 100% done when there is no storage space left to complete the rebuild. |
18771 | When a Lightbits node is considered to be in a "permanent failure", Lightbits considers the cluster to be smaller. This will affect the minimum allowed replica count for new volumes. For example, in a three-nodes cluster, users will not be able to create a three-replica volume. |
18522 | When attempting to add a server to a cluster using lbcli 'create server' or rest post '/api/v2/servers", and the operation fails for any reason, 'list servers' could permanently show the new server in 'creating' state. |
18470 | Prior to upgrading Lightbits 2.2.x to this release, the dynamic capacity rebalancing features must be explicitly disabled. Using lbcli, the following commands should be executed: lbcli disable feature-flag fail-in-place; lbcli disable feature-flag proactive-rebalance. |
18214 | Automatic rebalancing features (fail-in-place and proactive-rebalance) should be disabled if enable_iptables is enabled during installation. |
17398 | Metrics scraping in Lightbits may take more than 10 seconds, depending on the number of volumes and snapshots. The default scrape_timeout and scrape_interval should be increased in case metrics are not collected by Prometheus. |
17329 | Lightbits exposes latency information per request size. The time window for latency measurement is not synchronized with the measurement of nr read/write requests. Therefore a weighted average calculation of latency over all request sizes will result in inaccurate latency information. |
17298 | The migration of volumes due to automatic rebalancing could take time, even when volumes are empty. |
15880 | The core configuration for Samsung SSDs requires a manual override. The number of cores allocated for the GFTL reader should be limited to four cores. |
15715 | During volume rebuild, Grafana dashboard does not show the write IOs for the recovered data. |
15496 | When using Intel® Ethernet Controller E810, stopping the node-manager service (e.g. systemctl stop node-manager) requires running "rmmod ice; modprobe ice" before restarting the node-manager service. |
15037 | With the IP Tables feature enabled, adding a new node requires opening the etcd ports for that node using the "lbcli create admin-endpoint" command. |
14995 | A single server cluster cannot be upgraded using the API. In order to upgrade, manually log into the server, stop the Lightbits services, run a yum update, and reboot. |
14889 | In case of an SSD failure, the system will scan the storage and rebuild the data. The entire raw capacity will be scanned, even when not all of it was utilized. This leads to a longer rebuild time than necessary. |
14863 | Prior to lb CSI installation, the lb discovery client service must be installed and started on all K8S cluster nodes. |
14787 | Lightbits installation will fail on systems with NVDIMMs that do not support auto labels. Workaround: Log into the server and issue the following command: ndctl create-namespace -f -e namespace0.0 --type=pmem --mode=dax --no-autolabel |
14212 | OpenStack: Once a volume attach fails, the following attempts to attach will also fail. Workaround: Remove the discovery-client configuration files for the failed volume and restart the discovery-client and Nova services. |
13680 | In a cluster deployed with a minimum of two replicas and when more than one node fails, after completing a rebuild for the three-replicas volume, this volume may stay in read-only mode if another node returns to active state at the same time. |
13434 | Invoking rmmod nvme-fabrics before stopping the discovery-client service could cause a kernel panic on the client. |
13253 | A local rebuild takes the same amount of time, independently of storage utilization. |
13147 | If during a "disable server" flow another server fails, then “disable server” might not occur and the server will become active again. The "disable server" command should be issued again. |
13064 | Following a 'replace node' operation, volumes with a single replica will be created as 'unavailable' in the new node. Note: Single replica volumes are not protected, and data will not move to the new node. Workaround: Delete single replica volumes before replacing the node, or reboot the new server after replacing the node. |
12950 | When nodes are configured with allowCrossNumaDevices=false, adding an nvme-device from a NUMA node other than the logical node's instance ID will not work. In order to be able to tell to which NUMA node the nvme device is connected, check numaNodeID returned by lbcli get nvme-device. To tell what the logical node's instance ID is, check the suffix of the node's name in the lbcli list nodes. |
12310 | After a volume becomes unavailable due to the failure of all replicas, it could take more than one replica to recover before the volume can be available again. |
11856 | Volume and node usage metrics might show different values between REST/lbcli and Prometheus when a volume is deleted and a node is disconnected. |
11565 | In some cases, during Lightbits power-up, if an NVMe device is unexpectedly reset, Lightbits may fail to load and crashes. This requires a server reboot to recover. |
11326 | Volume metrics do not return any value for volumes that are created but do not store any data. |
10763 | The resources list (servers, nodes, volumes, nvme-devices) can only be filtered by a single field (e.g., list nvme-devices by node UUID rather than by both node UUID and server UUID). |
10346 | Many SSD removals during SSD resets can lead to a memory leak and require a server reboot. |
10021 | Commands affecting SSD content (such as blkdiscard, nvme format) should not be executed on the Lightbits server. |
9219 | If the Lightbits server unexpectedly enters an inactive state, a manual reboot is required. |
6734 | Compression should not be enabled on AMD EPYC based systems. |