Auto Healing Overview
Auto Healing is a feature in Lightbits’ SDS in the cloud. It keeps the cluster healthy, handling scheduled maintenance (by AWS), as well as abrupt health issues (servers crashing or just shutting down and OS or application major failures). The feature utilizes notification and health checks in AWS, as well as Lightbits’ node status and Lightbits events.
Graceful healing is when an AWS SNS notification is received via the AWS EventBridge, for a specific server or servers in the cluster. Lightbits’ maintenance serverless function running in AWS will receive these messages. If there are any maintenance notifications on one or more of the servers in the cluster, the maintenance Lambda will enter the graceful-healing role and the server will be replaced with a new instance. Please note that the SNS notification can be received moments before or after an instance is shut down, or it can be sent weeks ahead of time if it is a planned maintenance.
The abrupt healing scenario is where the AWS hardware is ok but there is still a problem with the Lightbits node on the server (for example a Lightbits process that stopped responding or a kernel panic, or if an instance shut down abruptly but for some reason an SNS notification was not received).
Lightbits has a built-in distributed monitoring and event reporting system. Once the cluster identifies via internal heartbeat between the cluster’s nodes that there is a node that is not responding, the node is marked as failed and after one hour is marked as permanent failure. The default permanent failure threshold is one hour, but this can be reduced. When a permanent failure occurs, the maintenance serverless function enters the abrupt-healing role and exercises the abrupt healing flow.
Note that the main difference between abrupt healing and graceful healing is that in abrupt healing the system waits until the node is declared as “permanently failed” by the cluster, and in graceful healing it is done immediately when a notification is received.
Note that during the process where there is a reduced number of nodes in the cluster, the cluster is in a warning state and all the volumes that were on the node to be replaced are in degraded mode.
Configuration
There are no configurations specific to auto healing.
Process Overview
- A node/server fails (e.g., shutdown or kernel panic), or an SNS notification is received.
- A node in permanent failure is detected (by the cluster's heartbeat). If the SNS notification is received, it will start immediately and not wait for the cluster to detect permanent failure.
- The cluster ASG instance count is temporarily increased (scaled out) to provide a replacement instance.
- The new node is configured.
- The Lightbits storage cluster will start a fail in place process (node rebuild), and migrate all the volumes and data of the failed node from the other replicas to the new node.
- Once the node rebuild process is complete, if the failed node instance is still running, it will terminate the instance.
- The cluster ASG instance count is decreased back to its original size.
Limitation
Lightbits STS in AWS can support failure of multiple instances in the cluster, but they will be recovered by the maintenance process one at a time.