Lightbits System Health Monitoring

Lightbits supports interfacing with the Prometheus open-source systems monitoring and alerting toolkit.

Metrics and Statistics

The following groups of collectors are supported by Lightbits to report node and cluster metrics and statistics on the Prometheus dashboard.

By default, the following collector groups are enabled.

Name	Scope	Description
Total user IOPS (read,write)	cluster	IOps user read / user write (not including replications).
Total user throughput (read,write)	cluster	Throughput (Bps) user read / user write (not including replications).
Cluster Installed Physical Storage	cluster	All installed SSDs capacities over all servers in cluster, given in bytes. The sum includes Inactive nodes.
Cluster Managed Physical Storage	cluster	All managed and healthy SSDs capacities, given in bytes. The sum includes Inactive nodes.
Cluster Effective Physical Storage	cluster	Effective physical storage excluding overhead of EC and OVP, given in bytes. The sum includes Inactive nodes.
Cluster Logical Storage	cluster	Sum of capacities of all allocated volumes, given in bytes. The sum includes Inactive nodes.
Cluster Free Physical Storage	cluster	Free storage before entering read-only mode. The sum includes Inactive nodes.
Cluster Estimated Free Logical Storage	cluster	Estimated free storage before entering read-only mode assuming the current compression ratio. The sum includes Inactive nodes.
Cluster Estimated Effective Logical Storage	cluster	Estimated storage up to RO threshold from the point view of the client (considering compression). The sum includes Inactive nodes.
Cluster Physical Used Storage Including Parity	cluster	The sum of physical storage used on each of the nodes. The sum includes Inactive nodes.
Cluster Physical Used Storage	cluster	The sum of physical storage used on each of the nodes (sum on 1.x nodes physical metric, this does not account for EC overhead). Excluding Parity. The sum includes Inactive nodes.
Cluster Health	cluster	OK (no admin action required) == All volumes are fully protected. There is no inactive node, and no node in read-only warning (admin action required, but there is no loss of service) == There is at least one volume which is degraded. There is no volume that is read-only OR there is an Inactive or read-only node error (admin action required, there is loss of service) == There is a volume without working replicas or in a read-only state.
Number of active nodes	cluster
Number of failed nodes	cluster
Number of volumes	cluster
Number of degraded volumes	cluster
Number of volumes in read-only	cluster
Number of volume not-available	cluster
Cluster Logical Used Storage	cluster	Actual User Objects saved (should be equal to the number of valid LBAs * 4K Block Size)
Compression ratio	cluster
IOPS (read, write)(user/replication)	node	IOps user read / user write / replication tx / replication rx
Throughput (read,write)(user/replication)	node	Throughput (Bps) user read / user write / replication tx / replication rx
Latency avg	node
Average IO Size write/read	node
Node Managed Physical Storage	node	All managed and healthy SSDs capacities, given in bytes.
Node Logical Storage	node	Sum of capacities of all allocated volumes, given in bytes.
Node Estimated Effective Logical Storage	node	Estimated storage up to RO threshold from the point view of the client (considering compression).
Node Effective Physical Storage	node	Effective physical storage excluding overhead of EC and OVP, given in bytes.
Node Free Physical Storage	node
Node Estimated Free Logical Storage	node	Assuming current compression.
Node Physical Used Storage Including Parity	node	Physical storage used by a node (incl EC). Return value before taking into account any replication factor for unused Report Capacity available after internal Lightbits OP needs and up to RO threshold. Does not assume compression, may be limited by DRAM.
Node Physical Used Storage	node	Sum of used compressed storage of all valid LBAs in the node. Excluding Parity.
Node Logical Used Storage	node	Sum of all written LBAs by user.
Write amplification	node
gc skips	node
State	volume	Healthy / degraded / read-only / no-service
Clustering rebuild progress	volume
IOPS (read,write)	volume, node	IOps user read / user write / replication tx / replication rx
Throughput (read,write)	volume, node	Throughput (Bps) user read / user write / replication tx / replication rx
Average IO Size write/read	volume, node
Volume Logical Used Storage	volume, node	Logical storage space used by volume, given in bytes.
Volume Physical Used Storage	volume, node	Physical storage space used by volume, given in bytes. Excluding parity.
Compression ratio	volume, node

Alerts

You can use alert rules based on the Prometheus expression language for alert notifications, which can be used to send notifications to an external service. The following list details the Lightbits status alerts.

Name	Description
NodeAlmostReadOnlyMode	The node’s Lightbits system is approaching read only mode.
RebuildInProgress	The status of a node has begun a data rebuild process on a block device.
NodeReadOnlyMode	A node entered a read only mode. A node is in read only mode after a number of SSDs have been identified as failed.
NodeRebuildNotPossible	A node cannot rebuild data after a number of SSDs have been identified as failed.
NodePowerUpAfterAbruptShutdown	A node is powering up after an abrupt shutdown.
NodeBecameInactive	A node became not active (switched from active to any other state).
NodeBecameActive	A node became active (switched to active from any state).

Last updated on

Was this page helpful?