Auto Maintenance Overview

Lightbits SDS in AWS has several features to automate various maintenance activities. The maintenance process handles automatic maintenance operations on the storage cluster. This currently includes Automatic Healing (Auto Healing) of the cluster, Automatic Scaling (Auto Scaling), and Automated Upgrade (initiated via CF update).

Auto Healing

Due to the specific way the public cloud works, when an instance is shut down or fails, all of the data on the local disks of the server is lost. Therefore the cluster needs to respond to these scenarios, whether it is proactively when AWS gives an upfront notification on a server that has a planned maintenance, or reactively if a server fails or simply shuts down. The auto healing process replaces the required server and makes sure that all data over all replication is consistent.

Auto Scaling

The Auto Scaling Group (ASG) contains a collection of EC2 instances that are treated as a logical group for the purposes of automatic scaling and management. The ASG has a minimum, maximum, and desired number of instances. When a node reaches 90% capacity it becomes read only and data consistency is at risk. The auto scaling process increases the cluster size (adds an additional server) when the storage cluster is over a certain capacity as a preventive action, so the node will not become read-only. This increases the capacity capabilities of the cluster. Currently only scale-out is supported. Scale-in will be supported in one of the upcoming versions.

Cluster Automated Upgrade

When you want to trigger a cluster upgrade, Lightbits identifies this and initiates an automatic rolling upgrade with no downtime.

Maintenance Process

The maintenance process runs as a serverless process (a Lambda process in AWS), which is triggered periodically (currently every 60s). It identifies the current ASG state (the amount of instances available for the cluster), as well as the health state of each of the cluster’s nodes, scaling state, and upgrade state of the cluster. In turn the process applies the correct operations to keep the cluster healthy, at the correct scale and up to date.

All the maintenance features utilize the ASG capabilities. For example, when there is a need to add a server, it will be done by increasing the minimum, maximum, and desired server count in the ASG.

Maintenance State Machine

RoleDescriptionCondition
idleSystem is normal and working properly.All nodes are responsive and no upgrade is triggered.
upgradeUpgrade is in progress.There is an instance in the cluster that has an AMI that does not match the current AMI configured in the ASG’s launch template.
graceful-healingGraceful healing is in progress.An AWS maintenance notification has been received for any one of the instances in the ASG.
abrupt-healingFailed instance is being replaced with a new instance.An instance (node) in the ASG has failed (AWS failure or application failure), and needs to be replaced.
scale-outThe cluster is being scaled out (a new instance is added - not DR).The cluster capacity has surpassed the scale-out capacity threshold.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard