Release 3.19.1

Release Date

v3.19.1 was released to the public on May 18, 2026.

New in This Release

This release introduces the following changes since version 3.18.x. A change is classified as either a new feature, an enhancement, a major issue (e.g., an issue that could lead to potential data loss or service loss), or a minor issue.

Issue TypeDescriptionID
Enhancement A minor fix in lb-support, which by default appends the server hostname as part of the tar file created.LBM1-42579
Enhancement Added a config option, maxRestartAttempts, that on failure only restarts DL. This feature is disabled by default.LBM1-41473
Enhancement Added Prometheus alerts for cluster resource limits (volumes, snapshots, connected hosts, SSDs) with a warning at 70% and critical at 100% thresholds.LBM1-31689
Enhancement Added two new panels to the Grafana dashboard to display the number of unconnected nodes and the number of hosts impacted by connectivity issues. Unconnected nodes are those that a client (host) is expected to be connected to but are not. Connectivity-impacted hosts are the client (hosts) affected by these unconnected nodes. A client is supposed to be connected to all nodes that have at least one volume replica and where the volume ACL contains the client’s host NQN.LBM1-41407
Enhancement Added a new capability to increase RAM reserved by setting an optional additional custom reserve RAM value. This can be used when the default 8Gib RAM reserved for OS and non-Lightbits services does not suffice and additional RAM should be reserved. The total amount of reserved RAM is capped at 15% of the server memory (or the existing default: 21Gib x number of LB instances).LBM1-40362
Enhancement Added ctrlLossTMO to discovery-client.yaml (int, seconds, default is 600): older versions used -1, i.e., no timeout. The value is used in the nvme connect command --ctrl-loss-tmo parameter. Any controller that cannot reconnect within this time will be automatically removed by the nvme subsystem, thus preventing old controllers corresponding to disabled or deleted servers from reconnecting forever. The cluster discovery service notifies relevant discovery-clients when a node becomes active, so if a node is down and the corresponding controller is removed after 600s, discovery-client will reconnect it once the node becomes active. Note: This will not take effect for existing controllers connected by an older discovery-client. These can be adjusted by "echo 600 > /sys/class/nvme/nvme0/ctrl_loss_tmo". Important: If a volume becomes unavailable (e.g., a 1x volume with a node down) for more than ctrlLossTMO, the client block device will disappear, and could reappear as a different block device when the volume becomes available again. This may require manual action, especially if the volume was mounted.LBM1-37762
Enhancement Added documentation for discovery-client as a container in the "Openstack" environment.LBM1-44419
Enhancement Added support for log streaming. Configurable target and sources for streaming logs can be configured via the Lightbits official API/CLI (note that this feature is in tech preview). It is recommended to use log-level "Info" or higher (Error, Warning) for sustained logging with this tech preview release.LBM1-44321
Enhancement Added an extra panel in volume-performance, and extended volumes status in cluster-tab, reflecting volumes in the Migrating state.LBM1-36619
Enhancement Added recheck intervals of node progress. Continue sampling as long as there is progress; otherwise, fail the node.LBM1-42474
Enhancement AMD Dual Socket Support - All previous AMD dual socket supported related limitations have been lifted in this release.LBM1-43382
Enhancement Created another option to use the Discovery-client as a Docker image.LBM1-42400
Enhancement Enabled encrypted clones from unencrypted base snapshots, allowing derived volumes to use unique keys. Supported on clusters configured with encryption.LBM1-42408
Enhancement Enhanced Connected-hosts and Volumes APIs to track and expose hosts not connected to some volume replica(s), based on host connectivity and ACL's value. Updated lbcli with new status columns to provide real-time visibility into missing node paths and volume's replica accessibility.LBM1-40811
Enhancement Fixed Lightbits cluster status to correctly reflect installation completion in OpenShift console (in the CRD), for a Lightbits cluster custom resource.LBM1-40401
Enhancement Fixed nil errors in volumeapi and data_layer/tasks to ensure correct service startup.LBM1-43621
Enhancement Fixed runtime nil errors in lightbox-exporter and profile-generator for accurate telemetry collection.LBM1-43625
Enhancement Fixed unhandled errors in API core request handlers to improve reliability.LBM1-43619
Enhancement Fixed unhandled errors in cluster-manager startup, resolving two nilerr and two errorlint findings.LBM1-43618
Enhancement Fixed unhandled linting errors (errcheck, nilerr, errorlint) in shared packages and SDK/CLI.LBM1-43622
Enhancement Fixed nilerr issues during subsystem initialization at process startup, preventing early failures.LBM1-43620
Enhancement los-csi: Added support for K8s version v1.35.1.LBM1-42361
Enhancement Node manager: Improved error handling robustness.LBM1-43617
Enhancement node-manager: If a journaling device fails on a single node in a dual-node server, we will trigger a restart of only the single node-instance, and not the entire node-manager service.LBM1-39359
Enhancement

NVMe error log alerts and metrics, with configurable DNR filtering. A new collector and provider were added, shelling out to nvme error-log --output-format=json and exposing two Prometheus metrics: lightbox_nvmeerrorlog_entries_total (sum of the error_count field across all log entries sharing the same (device, sct, sc, dnr) key) and lightbox_nvmeerrorlog_last_entry_count (the maximum error_count among those entries), both labeled with {device, sct, sc, error_type, dnr}. The collector has a dnr_only config param (default "true"), When "true", only entries with the DNR (Do Not Retry) bit set are emitted. Setting it to "false" exposes all non-empty log entries, with the dnr label present on every metric so consumers can still filter to DNR-only in PromQL. Eight DNR=true alert rules were added: (NVmeUnrecoveredReadError, NVmeWriteFault, NVmeE2EIntegrityError, NVmeAccessDenied, NVmeInternalDeviceError, NVmeLBAOutOfRange, NVmeMultipleErrorTypes, NVmeErrorLogEntryDetected)

Plus five DNR=false "retryable" counterparts: (NVmeRepeatedRetryableErrors, NVmeInternalDeviceErrorRetryable, NVmeUnrecoveredReadErrorRetryable, NVmeWriteFaultRetryable, NVmeE2EIntegrityErrorRetryable) which only fire when the exporter is deployed with dnr_only: "false".

LBM1-42478
Enhancement SMART alerts, previously-unreported metrics, and dynamic per-device thresholds. Two new temperature threshold metrics were added to the SMART collector: lightbox_smart_warning_comp_temperature_threshold_celsius and lightbox_smart_critical_comp_temperature_threshold_celsius, parsed from the same smartctl -A -i -H invocation (no extra process) using the keys warning__comp__temperature_threshold / critical_comp__temperature_threshold. Two metrics that were already implemented in code but missing from documentation were also surfaced: lightbox_smart_device_info (with 9 labels including model, serial, firmware) and lightbox_smart_device_smart_healthy (0=unhealthy, 1=healthy). The new alert rules use these threshold metrics dynamically - e.g., SmartTemperatureWarning fires when temperature exceeds lightbox_smart_warning_comp_temperature_threshold_celsius - 5, with an or fallback to 70°C for devices that don't report the threshold, and similarly SmartTemperatureCritical falls back to 80°C.LBM1-42479
Enhancement SSD alerts and metrics consist of invariant labels, serial-number, node-uuid, server-uuid.LBM1-43748
Enhancement Updated los-csi protobufs to latest.LBM1-41532
Enhancement Updated the CSI driver to support spec 1.12, and also updated CSI sidecar images.LBM1-19818
Major A race condition that could cause a deadlock during updates to a Protection Rule (PR) or volume protection state has been resolved.LBM1-42800
Major Avoid switching from secondary to primary when the node is rebuilding.LBM1-42314
Major Fixed a PG migration stall caused by snapshot deletion during createExistingSnapshots. To improve resiliency, the process now skips failed snapshots rather than aborting, ensuring that remaining node-snapshot keys are still written.LBM1-41873
Major Fixed a rare race where CM could complete and delete a task after UM fetched this task key, but before the task was loaded. An error was returned, preventing loading of remaining tasks, and ultimately causing the server upgrade to fail since the server upgrade task was not loaded.LBM1-42249
Major Increased monitoring-stack deployment timeout.LBM1-44327
Major node-manager: Fixed a potential deadlock that could cause the node to hang.LBM1-42974
Major Fixed a rare condition where a storage node could fail to start if placement group membership changed while the node was offline. Memory pressure during a prior recovery could leave stale metadata persisted due to internal counters not being reset; a subsequent restart would then fail a consistency check. The recovery fallback logic now fully resets all affected counters when a partial failure is detected.LBM1-43456
Major On rare occasions, when a node was recovering and pre-existing volume logical stats were being updated as part of the recovery process and the volume delete command was received - the GFTL could crash.LBM1-43379
Minor A resiliency safeguard has been added to prevent the Node Manager (NM) from reassigning a device already designated as a data device for use as a journal device, further improving overall cluster robustness.LBM1-44298
Minor Added input validation for snapshot creation to reject retention time values exceeding 192 years, preventing an unintended API service restart.LBM1-41466
Minor Cluster-manager: Fixed a race condition in the PG replacement flow that could place the same node UUID twice in the same PeerInfoList, causing all volumes in the PG to get stuck in degraded mode caused by duplicate-node PeerInfoList. This can happen when two members suffered permanent failure almost simultaneously. In this fix, we introduced two safeguards to prevent this from happening.LBM1-44273
Minor cluster-manager: Fixed a rare condition where a deprecated event key in etcd could cause the event cleaner to exit, leading to event accumulation and a potential out-of-memory (OOM) condition during CM switchover.LBM1-42614
Minor cluster-manager: Fixed an issue that could cause a VolumeInDegradedProtectionState event to incorrectly indicate that migrating volumes are degraded during dynamic rebalancing, when they are in fact fully protected.LBM1-33865
Minor Cosmetic change in lbcli fetch logs from all servers, to print errors only once.LBM1-42344
Minor Fixed CSI driver status to correctly reflect installation completion in OpenShift console (in the CRD), for a Lightbits CSI driver custom resource.LBM1-40403
Minor Fix for install-lightos cleanup task, deleting server-config.yaml, which holds cluster endpoints.LBM1-42366
Minor Fixed systest- wait for etcd to be updated with cluster state.LBM1-44013
Minor Fixed a condition where, with cluster encryption enabled, a CM failover sequence coinciding with a precise time window during KEK rotation may prevent the Cluster Manager service from restarting successfully, blocking the processing of new API requests and the ability to react to cluster state changes.LBM1-42309
Minor Fixed a rare condition where Duroslight could hang for approximately five minutes during shutdown. Duroslight now cancels pending futures upon receiving the shutdown command. As a result, rebuild times upon recovery may be slightly longer.LBM1-38706
Minor Fixed a rare race of a snapshot deletion during a migration setup on the target node, which results in leaving the stale snapshot in the target node GFTL. This may cause a failure in a future rebuild/migration.LBM1-42584
Minor Fixed an issue where the system could reject adding a healthy NVMe device if the request was invoked - while the server was rebuilding data after a different NVMe device had previously failed.LBM1-43410
Minor Fixed incorrect "Node inactive: Connectivity Issue" event after server disable operation. Now a different event is issued to indicate that the node is inactive due to server disable.LBM1-40182
Minor Fixed journal device events that were missing a node field.LBM1-42966
Minor Fixed unit test.LBM1-42490
Minor Fixing systemd collector to show metrics in Lightbits documentation, and reduce the amount of logs the collector produces.LBM1-42172
Minor Increased accuracy of the alert calculation logic and improved alert message firing for NodeRebuildNotPossible, to ensure that the alert triggers as expected.LBM1-41095
Minor Replaced the deprecated gcr.io Docker registry with quay.io to ensure uninterrupted functionality.LBM1-43353
Minor Fixed an issue where a network disconnect that could coincide exactly with a change of state of a NVMe SSD device, could prevent correct updates of future changes of this specific NVMe SSD device state.LBM1-42991
Minor Updated api to v1.LBM1-40420
Minor When a journaling NVMe device fails (e.g., is removed from an MDADM RAID0 array), and the node is later rebooted, the duroslight service fails to start because the device is still marked as failed in etcd. Previously, there was no way to recover from this state without manual etcd manipulation. The new ssd-recovery CLI tool allows resetting a failed journal device back to a healthy state. It validates that the device exists, is a journal device, is in a failed state, and has a failure timestamp before making any changes.LBM1-42487

Installation and Upgradeability

You can upgrade to this release from all previous Lightbits 3.16.x, 3.17.x, and 3.18.x releases.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard