Lightbits System Health Monitoring
Lightbits supports interfacing with the Prometheus open-source systems monitoring and alerting toolkit.
Metrics and Statistics
The following groups of collectors are supported by Lightbits to report node and cluster metrics and statistics on the Prometheus dashboard.
By default, the following collector groups are enabled.
Name | Scope | Description |
---|---|---|
Total user IOPS (read,write) | cluster | IOps user read / user write (not including replications). |
Total user throughput (read,write) | cluster | Throughput (Bps) user read / user write (not including replications). |
Cluster Installed Physical Storage | cluster | All installed SSDs capacities over all servers in cluster, given in bytes. The sum includes Inactive nodes. |
Cluster Managed Physical Storage | cluster | All managed and healthy SSDs capacities, given in bytes. The sum includes Inactive nodes. |
Cluster Effective Physical Storage | cluster | Effective physical storage excluding overhead of EC and OVP, given in bytes. The sum includes Inactive nodes. |
Cluster Logical Storage | cluster | Sum of capacities of all allocated volumes, given in bytes. The sum includes Inactive nodes. |
Cluster Free Physical Storage | cluster | Free storage before entering read-only mode. The sum includes Inactive nodes. |
Cluster Estimated Free Logical Storage | cluster | Estimated free storage before entering read-only mode assuming the current compression ratio. The sum includes Inactive nodes. |
Cluster Estimated Effective Logical Storage | cluster | Estimated storage up to RO threshold from the point view of the client (considering compression). The sum includes Inactive nodes. |
Cluster Physical Used Storage Including Parity | cluster | The sum of physical storage used on each of the nodes. The sum includes Inactive nodes. |
Cluster Physical Used Storage | cluster | The sum of physical storage used on each of the nodes (sum on 1.x nodes physical metric, this does not account for EC overhead). Excluding Parity. The sum includes Inactive nodes. |
Cluster Health | cluster | OK (no admin action required) == All volumes are fully protected. There is no inactive node, and no node in read-only warning (admin action required, but there is no loss of service) == There is at least one volume which is degraded. There is no volume that is read-only OR there is an Inactive or read-only node error (admin action required, there is loss of service) == There is a volume without working replicas or in a read-only state. |
Number of active nodes | cluster | |
Number of failed nodes | cluster | |
Number of volumes | cluster | |
Number of degraded volumes | cluster | |
Number of volumes in read-only | cluster | |
Number of volume not-available | cluster | |
Cluster Logical Used Storage | cluster | Actual User Objects saved (should be equal to the number of valid LBAs * 4K Block Size) |
Compression ratio | cluster | |
IOPS (read, write)(user/replication) | node | IOps user read / user write / replication tx / replication rx |
Throughput (read,write)(user/replication) | node | Throughput (Bps) user read / user write / replication tx / replication rx |
Latency avg | node | |
Average IO Size write/read | node | |
Node Managed Physical Storage | node | All managed and healthy SSDs capacities, given in bytes. |
Node Logical Storage | node | Sum of capacities of all allocated volumes, given in bytes. |
Node Estimated Effective Logical Storage | node | Estimated storage up to RO threshold from the point view of the client (considering compression). |
Node Effective Physical Storage | node | Effective physical storage excluding overhead of EC and OVP, given in bytes. |
Node Free Physical Storage | node | |
Node Estimated Free Logical Storage | node | Assuming current compression. |
Node Physical Used Storage Including Parity | node | Physical storage used by a node (incl EC). Return value before taking into account any replication factor for unused Report Capacity available after internal Lightbits OP needs and up to RO threshold. Does not assume compression, may be limited by DRAM. |
Node Physical Used Storage | node | Sum of used compressed storage of all valid LBAs in the node. Excluding Parity. |
Node Logical Used Storage | node | Sum of all written LBAs by user. |
Write amplification | node | |
gc skips | node | |
State | volume | Healthy / degraded / read-only / no-service |
Clustering rebuild progress | volume | |
IOPS (read,write) | volume, node | IOps user read / user write / replication tx / replication rx |
Throughput (read,write) | volume, node | Throughput (Bps) user read / user write / replication tx / replication rx |
Average IO Size write/read | volume, node | |
Volume Logical Used Storage | volume, node | Logical storage space used by volume, given in bytes. |
Volume Physical Used Storage | volume, node | Physical storage space used by volume, given in bytes. Excluding parity. |
Compression ratio | volume, node |
Alerts
You can use alert rules based on the Prometheus expression language for alert notifications, which can be used to send notifications to an external service. The following list details the Lightbits status alerts.
Name | Description |
---|---|
NodeAlmostReadOnlyMode | The node’s Lightbits system is approaching read only mode. |
RebuildInProgress | The status of a node has begun a data rebuild process on a block device. |
NodeReadOnlyMode | A node entered a read only mode. A node is in read only mode after a number of SSDs have been identified as failed. |
NodeRebuildNotPossible | A node cannot rebuild data after a number of SSDs have been identified as failed. |
NodePowerUpAfterAbruptShutdown | A node is powering up after an abrupt shutdown. |
NodeBecameInactive | A node became not active (switched from active to any other state). |
NodeBecameActive | A node became active (switched to active from any state). |
Was this page helpful?