Lightbits System Health Monitoring

Lightbits supports interfacing with the Prometheus open-source systems monitoring and alerting toolkit.

Metrics and Statistics

The following groups of collectors are supported by Lightbits to report node and cluster metrics and statistics on the Prometheus dashboard.

By default, the following collector groups are enabled.

NameScopeDescription
Total user IOPS (read,write)clusterIOps user read / user write (not including replications).
Total user throughput (read,write)clusterThroughput (Bps) user read / user write (not including replications).
Cluster Installed Physical StorageclusterAll installed SSDs capacities over all servers in cluster, given in bytes. The sum includes Inactive nodes.
Cluster Managed Physical StorageclusterAll managed and healthy SSDs capacities, given in bytes. The sum includes Inactive nodes.
Cluster Effective Physical StorageclusterEffective physical storage excluding overhead of EC and OVP, given in bytes. The sum includes Inactive nodes.
Cluster Logical StorageclusterSum of capacities of all allocated volumes, given in bytes. The sum includes Inactive nodes.
Cluster Free Physical StorageclusterFree storage before entering read-only mode. The sum includes Inactive nodes.
Cluster Estimated Free Logical StorageclusterEstimated free storage before entering read-only mode assuming the current compression ratio. The sum includes Inactive nodes.
Cluster Estimated Effective Logical StorageclusterEstimated storage up to RO threshold from the point view of the client (considering compression). The sum includes Inactive nodes.
Cluster Physical Used Storage Including ParityclusterThe sum of physical storage used on each of the nodes. The sum includes Inactive nodes.
Cluster Physical Used StorageclusterThe sum of physical storage used on each of the nodes (sum on 1.x nodes physical metric, this does not account for EC overhead). Excluding Parity. The sum includes Inactive nodes.
Cluster HealthclusterOK (no admin action required) == All volumes are fully protected. There is no inactive node, and no node in read-only warning (admin action required, but there is no loss of service) == There is at least one volume which is degraded. There is no volume that is read-only OR there is an Inactive or read-only node error (admin action required, there is loss of service) == There is a volume without working replicas or in a read-only state.
Number of active nodescluster
Number of failed nodescluster
Number of volumescluster
Number of degraded volumescluster
Number of volumes in read-onlycluster
Number of volume not-availablecluster
Cluster Logical Used StorageclusterActual User Objects saved (should be equal to the number of valid LBAs * 4K Block Size)
Compression ratiocluster
IOPS (read, write)(user/replication)nodeIOps user read / user write / replication tx / replication rx
Throughput (read,write)(user/replication)nodeThroughput (Bps) user read / user write / replication tx / replication rx
Latency avgnode
Average IO Size write/readnode
Node Managed Physical StoragenodeAll managed and healthy SSDs capacities, given in bytes.
Node Logical StoragenodeSum of capacities of all allocated volumes, given in bytes.
Node Estimated Effective Logical StoragenodeEstimated storage up to RO threshold from the point view of the client (considering compression).
Node Effective Physical StoragenodeEffective physical storage excluding overhead of EC and OVP, given in bytes.
Node Free Physical Storagenode
Node Estimated Free Logical StoragenodeAssuming current compression.
Node Physical Used Storage Including ParitynodePhysical storage used by a node (incl EC). Return value before taking into account any replication factor for unused Report Capacity available after internal Lightbits OP needs and up to RO threshold. Does not assume compression, may be limited by DRAM.
Node Physical Used StoragenodeSum of used compressed storage of all valid LBAs in the node. Excluding Parity.
Node Logical Used StoragenodeSum of all written LBAs by user.
Write amplificationnode
gc skipsnode
StatevolumeHealthy / degraded / read-only / no-service
Clustering rebuild progressvolume
IOPS (read,write)volume, nodeIOps user read / user write / replication tx / replication rx
Throughput (read,write)volume, nodeThroughput (Bps) user read / user write / replication tx / replication rx
Average IO Size write/readvolume, node
Volume Logical Used Storagevolume, nodeLogical storage space used by volume, given in bytes.
Volume Physical Used Storagevolume, nodePhysical storage space used by volume, given in bytes. Excluding parity.
Compression ratiovolume, node

Alerts

You can use alert rules based on the Prometheus expression language for alert notifications, which can be used to send notifications to an external service. The following list details the Lightbits status alerts.

NameDescription
NodeAlmostReadOnlyModeThe node’s Lightbits system is approaching read only mode.
RebuildInProgressThe status of a node has begun a data rebuild process on a block device.
NodeReadOnlyModeA node entered a read only mode. A node is in read only mode after a number of SSDs have been identified as failed.
NodeRebuildNotPossibleA node cannot rebuild data after a number of SSDs have been identified as failed.
NodePowerUpAfterAbruptShutdownA node is powering up after an abrupt shutdown.
NodeBecameInactiveA node became not active (switched from active to any other state).
NodeBecameActiveA node became active (switched to active from any state).
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard