Auto Revive
Lightbits supports auto revive of the node level service. If a node-manager stops functioning or frontend processes halt - or a kernel bug is detected - the server is rebooted and Lightbits services are restarted.
The auto revive feature will attempt up to two auto revives in a two-hour window. The number of attempts and the time window can be modified using the update cluster config API.
The correct operation of auto revive requires generation and storing of some local files. By default, these are are placed at /var/cache/node-manager
. However, they can be located on any path on the server by updating node-manager yaml:
nodeManagerAutoReviveDir: /var/cache/node-manager
. The service can then be restarted.
clusterconfig
AllowedNumRevives: This is the number of attempts to revive services in a specified time window (the default is set to 2; a 0 value will disable the feature).
RevivesWindowDuration: This is the time window to monitor the number of auto revive attempts (the default is set to two hours).
lbcli -J $JWT update cluster-config --parameter=AllowedNumRevives --value=3
lbcli -J $JWT update cluster-config --parameter=RevivesWindowDuration --value=3h
// disable auto revive
lbcli -J $JWT update cluster-config --parameter=AllowedNumRevives --value=0