Title
Create new category
Edit page index title
Edit category
Edit link
Release 3.19.1
Release Date
v3.19.1 was released to the public on May 18, 2026.
New in This Release
This release introduces the following changes since version 3.18.x. A change is classified as either a new feature, an enhancement, a major issue (e.g., an issue that could lead to potential data loss or service loss), or a minor issue.
| Issue Type | Description | ID |
|---|---|---|
| A minor fix in lb-support, which by default appends the server hostname as part of the tar file created. | LBM1-42579 | |
| Added a config option, maxRestartAttempts, that on failure only restarts DL. This feature is disabled by default. | LBM1-41473 | |
| Added Prometheus alerts for cluster resource limits (volumes, snapshots, connected hosts, SSDs) with a warning at 70% and critical at 100% thresholds. | LBM1-31689 | |
| Added two new panels to the Grafana dashboard to display the number of unconnected nodes and the number of hosts impacted by connectivity issues. Unconnected nodes are those that a client (host) is expected to be connected to but are not. Connectivity-impacted hosts are the client (hosts) affected by these unconnected nodes. A client is supposed to be connected to all nodes that have at least one volume replica and where the volume ACL contains the client’s host NQN. | LBM1-41407 | |
| Added a new capability to increase RAM reserved by setting an optional additional custom reserve RAM value. This can be used when the default 8Gib RAM reserved for OS and non-Lightbits services does not suffice and additional RAM should be reserved. The total amount of reserved RAM is capped at 15% of the server memory (or the existing default: 21Gib x number of LB instances). | LBM1-40362 | |
| Added ctrlLossTMO to discovery-client.yaml (int, seconds, default is 600): older versions used -1, i.e., no timeout. The value is used in the nvme connect command --ctrl-loss-tmo parameter. Any controller that cannot reconnect within this time will be automatically removed by the nvme subsystem, thus preventing old controllers corresponding to disabled or deleted servers from reconnecting forever. The cluster discovery service notifies relevant discovery-clients when a node becomes active, so if a node is down and the corresponding controller is removed after 600s, discovery-client will reconnect it once the node becomes active. Note: This will not take effect for existing controllers connected by an older discovery-client. These can be adjusted by "echo 600 > /sys/class/nvme/nvme0/ctrl_loss_tmo". Important: If a volume becomes unavailable (e.g., a 1x volume with a node down) for more than ctrlLossTMO, the client block device will disappear, and could reappear as a different block device when the volume becomes available again. This may require manual action, especially if the volume was mounted. | LBM1-37762 | |
| Added documentation for discovery-client as a container in the "Openstack" environment. | LBM1-44419 | |
| Added support for log streaming. Configurable target and sources for streaming logs can be configured via the Lightbits official API/CLI (note that this feature is in tech preview). It is recommended to use log-level "Info" or higher (Error, Warning) for sustained logging with this tech preview release. | LBM1-44321 | |
| Added an extra panel in volume-performance, and extended volumes status in cluster-tab, reflecting volumes in the Migrating state. | LBM1-36619 | |
| Added recheck intervals of node progress. Continue sampling as long as there is progress; otherwise, fail the node. | LBM1-42474 | |
| AMD Dual Socket Support - All previous AMD dual socket supported related limitations have been lifted in this release. | LBM1-43382 | |
| Created another option to use the Discovery-client as a Docker image. | LBM1-42400 | |
| Enabled encrypted clones from unencrypted base snapshots, allowing derived volumes to use unique keys. Supported on clusters configured with encryption. | LBM1-42408 | |
| Enhanced Connected-hosts and Volumes APIs to track and expose hosts not connected to some volume replica(s), based on host connectivity and ACL's value. Updated lbcli with new status columns to provide real-time visibility into missing node paths and volume's replica accessibility. | LBM1-40811 | |
| Fixed Lightbits cluster status to correctly reflect installation completion in OpenShift console (in the CRD), for a Lightbits cluster custom resource. | LBM1-40401 | |
| Fixed nil errors in volumeapi and data_layer/tasks to ensure correct service startup. | LBM1-43621 | |
| Fixed runtime nil errors in lightbox-exporter and profile-generator for accurate telemetry collection. | LBM1-43625 | |
| Fixed unhandled errors in API core request handlers to improve reliability. | LBM1-43619 | |
| Fixed unhandled errors in cluster-manager startup, resolving two nilerr and two errorlint findings. | LBM1-43618 | |
| Fixed unhandled linting errors (errcheck, nilerr, errorlint) in shared packages and SDK/CLI. | LBM1-43622 | |
| Fixed nilerr issues during subsystem initialization at process startup, preventing early failures. | LBM1-43620 | |
los-csi: Added support for K8s version v1.35.1. | LBM1-42361 | |
| Node manager: Improved error handling robustness. | LBM1-43617 | |
node-manager: If a journaling device fails on a single node in a dual-node server, we will trigger a restart of only the single node-instance, and not the entire node-manager service. | LBM1-39359 | |
NVMe error log alerts and metrics, with configurable DNR filtering. A new collector and provider were added, shelling out to nvme error-log --output-format=json and exposing two Prometheus metrics: lightbox_nvmeerrorlog_entries_total (sum of the error_count field across all log entries sharing the same (device, sct, sc, dnr) key) and lightbox_nvmeerrorlog_last_entry_count (the maximum error_count among those entries), both labeled with {device, sct, sc, error_type, dnr}. The collector has a dnr_only config param (default "true"), When "true", only entries with the DNR (Do Not Retry) bit set are emitted. Setting it to "false" exposes all non-empty log entries, with the dnr label present on every metric so consumers can still filter to DNR-only in PromQL. Eight DNR=true alert rules were added: (NVmeUnrecoveredReadError, NVmeWriteFault, NVmeE2EIntegrityError, NVmeAccessDenied, NVmeInternalDeviceError, NVmeLBAOutOfRange, NVmeMultipleErrorTypes, NVmeErrorLogEntryDetected) Plus five DNR=false "retryable" counterparts: (NVmeRepeatedRetryableErrors, NVmeInternalDeviceErrorRetryable, NVmeUnrecoveredReadErrorRetryable, NVmeWriteFaultRetryable, NVmeE2EIntegrityErrorRetryable) which only fire when the exporter is deployed with dnr_only: "false". | LBM1-42478 | |
| SMART alerts, previously-unreported metrics, and dynamic per-device thresholds. Two new temperature threshold metrics were added to the SMART collector: lightbox_smart_warning_comp_temperature_threshold_celsius and lightbox_smart_critical_comp_temperature_threshold_celsius, parsed from the same smartctl -A -i -H invocation (no extra process) using the keys warning__comp__temperature_threshold / critical_comp__temperature_threshold. Two metrics that were already implemented in code but missing from documentation were also surfaced: lightbox_smart_device_info (with 9 labels including model, serial, firmware) and lightbox_smart_device_smart_healthy (0=unhealthy, 1=healthy). The new alert rules use these threshold metrics dynamically - e.g., SmartTemperatureWarning fires when temperature exceeds lightbox_smart_warning_comp_temperature_threshold_celsius - 5, with an or fallback to 70°C for devices that don't report the threshold, and similarly SmartTemperatureCritical falls back to 80°C. | LBM1-42479 | |
| SSD alerts and metrics consist of invariant labels, serial-number, node-uuid, server-uuid. | LBM1-43748 | |
| Updated los-csi protobufs to latest. | LBM1-41532 | |
| Updated the CSI driver to support spec 1.12, and also updated CSI sidecar images. | LBM1-19818 | |
| A race condition that could cause a deadlock during updates to a Protection Rule (PR) or volume protection state has been resolved. | LBM1-42800 | |
| Avoid switching from secondary to primary when the node is rebuilding. | LBM1-42314 | |
| Fixed a PG migration stall caused by snapshot deletion during createExistingSnapshots. To improve resiliency, the process now skips failed snapshots rather than aborting, ensuring that remaining node-snapshot keys are still written. | LBM1-41873 | |
| Fixed a rare race where CM could complete and delete a task after UM fetched this task key, but before the task was loaded. An error was returned, preventing loading of remaining tasks, and ultimately causing the server upgrade to fail since the server upgrade task was not loaded. | LBM1-42249 | |
| Increased monitoring-stack deployment timeout. | LBM1-44327 | |
node-manager: Fixed a potential deadlock that could cause the node to hang. | LBM1-42974 | |
| Fixed a rare condition where a storage node could fail to start if placement group membership changed while the node was offline. Memory pressure during a prior recovery could leave stale metadata persisted due to internal counters not being reset; a subsequent restart would then fail a consistency check. The recovery fallback logic now fully resets all affected counters when a partial failure is detected. | LBM1-43456 | |
| On rare occasions, when a node was recovering and pre-existing volume logical stats were being updated as part of the recovery process and the volume delete command was received - the GFTL could crash. | LBM1-43379 | |
| A resiliency safeguard has been added to prevent the Node Manager (NM) from reassigning a device already designated as a data device for use as a journal device, further improving overall cluster robustness. | LBM1-44298 | |
| Added input validation for snapshot creation to reject retention time values exceeding 192 years, preventing an unintended API service restart. | LBM1-41466 | |
Cluster-manager: Fixed a race condition in the PG replacement flow that could place the same node UUID twice in the same PeerInfoList, causing all volumes in the PG to get stuck in degraded mode caused by duplicate-node PeerInfoList. This can happen when two members suffered permanent failure almost simultaneously. In this fix, we introduced two safeguards to prevent this from happening. | LBM1-44273 | |
cluster-manager: Fixed a rare condition where a deprecated event key in etcd could cause the event cleaner to exit, leading to event accumulation and a potential out-of-memory (OOM) condition during CM switchover. | LBM1-42614 | |
cluster-manager: Fixed an issue that could cause a VolumeInDegradedProtectionState event to incorrectly indicate that migrating volumes are degraded during dynamic rebalancing, when they are in fact fully protected. | LBM1-33865 | |
| Cosmetic change in lbcli fetch logs from all servers, to print errors only once. | LBM1-42344 | |
| Fixed CSI driver status to correctly reflect installation completion in OpenShift console (in the CRD), for a Lightbits CSI driver custom resource. | LBM1-40403 | |
| Fix for install-lightos cleanup task, deleting server-config.yaml, which holds cluster endpoints. | LBM1-42366 | |
| Fixed systest- wait for etcd to be updated with cluster state. | LBM1-44013 | |
| Fixed a condition where, with cluster encryption enabled, a CM failover sequence coinciding with a precise time window during KEK rotation may prevent the Cluster Manager service from restarting successfully, blocking the processing of new API requests and the ability to react to cluster state changes. | LBM1-42309 | |
| Fixed a rare condition where Duroslight could hang for approximately five minutes during shutdown. Duroslight now cancels pending futures upon receiving the shutdown command. As a result, rebuild times upon recovery may be slightly longer. | LBM1-38706 | |
| Fixed a rare race of a snapshot deletion during a migration setup on the target node, which results in leaving the stale snapshot in the target node GFTL. This may cause a failure in a future rebuild/migration. | LBM1-42584 | |
| Fixed an issue where the system could reject adding a healthy NVMe device if the request was invoked - while the server was rebuilding data after a different NVMe device had previously failed. | LBM1-43410 | |
| Fixed incorrect "Node inactive: Connectivity Issue" event after server disable operation. Now a different event is issued to indicate that the node is inactive due to server disable. | LBM1-40182 | |
| Fixed journal device events that were missing a node field. | LBM1-42966 | |
| Fixed unit test. | LBM1-42490 | |
| Fixing systemd collector to show metrics in Lightbits documentation, and reduce the amount of logs the collector produces. | LBM1-42172 | |
| Increased accuracy of the alert calculation logic and improved alert message firing for NodeRebuildNotPossible, to ensure that the alert triggers as expected. | LBM1-41095 | |
| Replaced the deprecated gcr.io Docker registry with quay.io to ensure uninterrupted functionality. | LBM1-43353 | |
| Fixed an issue where a network disconnect that could coincide exactly with a change of state of a NVMe SSD device, could prevent correct updates of future changes of this specific NVMe SSD device state. | LBM1-42991 | |
| Updated api to v1. | LBM1-40420 | |
| When a journaling NVMe device fails (e.g., is removed from an MDADM RAID0 array), and the node is later rebooted, the duroslight service fails to start because the device is still marked as failed in etcd. Previously, there was no way to recover from this state without manual etcd manipulation. The new ssd-recovery CLI tool allows resetting a failed journal device back to a healthy state. It validates that the device exists, is a journal device, is in a failed state, and has a failure timestamp before making any changes. | LBM1-42487 |
Installation and Upgradeability
You can upgrade to this release from all previous Lightbits 3.16.x, 3.17.x, and 3.18.x releases.
© 2026 Lightbits Labs™