Lightbits Server Maintenance and Handling

This article details various maintenance-related tips, best practices, and other information for effectively handling Lightbits servers.

This article is for advanced users. It requires a high level of knowledge of Lightbits server maintenance. For any additional questions, contact Lightbits Support.

Server Management Components and Responsibilities

This section discusses the management components and responsibilities of the Lightbits server.

api-service

  • The ears of the system. Handles API (REST/gRPC) requests.
  • active-active, runs on all servers.
  • Asynchronous by design.

cluster-manager

  • The brain of the system. Handles cluster-wide operations (e.g., places volumes, decided to set nodes as active/inactive, triggers dynamic rebalancing).
  • active-passive, runs on all servers. Only one instance (“the leader”) is active.

node-manager

  • The arms of the system. Responsible for orchestrating the local services and resources, based on Cluster-Manager requests.
  • One instance per server.

lightbox-exporter

  • The mouth of the server. Exports the system metrics to Prometheus (also used for alerts/alarms).
  • One instance per server.

discovery-service

  • The eyes of the server. Keeps clients informed on how to connect to the cluster, and updates them when there are changes (for example, servers being added/removed from the cluster).
  • active-active, runs on all servers.

upgrade-manager

  • Cluster-wide mini service: active-passive, runs on all servers. Only one instance ("the leader") is active.
  • Orchestrates and runs Lightbits upgrades.
  • One local mini service to update local services.

Lightbits Services Startup Workflow

Lightbits services starts after network.target and networking.service are up.

  1. profile-generator (PG): Service starts and exits (oneshot). Looks at hardware and generates configuration files for NM, DL, and GFTL. File: /etc/profile-generator/system-profile.yaml.
  2. etcd: Third party clustered key value database starts up and connects with peers (decides on leader), and begins operating.
  3. Starts in parallel after PG and etcd: api-service, discovery-service, upgrade-manager, and cluster-manager (CM). CM is the decision maker at the cluster level.
  4. node-manager (NM): Reads ETCD and configurations generated by profile-generator, to configure and start GFTL and then Duroslight.
  • GFTL (Global Flash Translation Layer): Takes all NVMes and sees them as one logical layer. This also completes abrupt and graceful recoveries. Abrupt: Scans through all Flash to understand the state (note that this could take many hours). Graceful: Starts quickly as the state is already saved on the NVMes by a previous graceful shutdown.
  • duroslight service[s]: Started as duroslight-0. In dual numa, it also starts duroslight-1. Uses NVMe over TCP protocol. All requests are sharded. All read and write requests are broken up into 128K chunks. For any given volume, the same CPU handles certain chunks. All requests from all clients are balanced across CPUs.
  • lightbox-exporter service: Generates metrics once Prometheus polls.
  • GFTL and Duroslight are not "enabled" (they do not start on boot). NM starts them.
  • States: Inactive → GFTL → Duroslight → Activating (connection to DL peers) → Active → volume rebuilds.

Lightbits Services Shutdown Sequence

During an abrupt shutdown, the system does not save the state. On boot, GFTL performs an abrupt recovery.

During a graceful shutdown, Duroslight is shut down first: It disconnects clients, and makes sure that any writes in progress are complete. - GFTL saves state information into ETCD and Flash array. Sync points for rebuilding are saved to ETCD. Other state information is saved to Flash array. - NM shuts down. - Remaining services are shut down.

Startup/Shutdown Order

The table below details the startup and shutdown order for the Lightbits Services server.

Lightbits Services Shutdown OrderLightbits Services Startup Order
api-serviceprofile-generator
profile-generatoretcd
node-managercluster-manager
cluster-managernode-manager
discovery-servicediscovery-service
lightbox-exporterlightbox-exporter
upgrade-managerupgrade-manager
etcdapi-service

Node Initialization Types

The following are the node initialization types and their causes and results.

Abrupt shutdown and abrupt recovery:

  • Caused by force kill NM or force reboot.
  • Slow powerup.

Graceful shutdown and abrupt recovery:

  • Caused by graceful stop and reboot.
  • Fast powerup.
  • Graceful stop NM takes time for it to save to GFTL state.

Starting NM happens instantly and starts the powerup procedure. It can be monitored via lbcli list nodes -o yaml or the NM log.

Common Software Services Issues

The chart below illustrates some of the common issues with Lightbits software services, as well as remediation steps for resolving them.

Issue/TaskRemediation Steps

No output with lbcli:

  • Incorrect or outdated JWT.
  • etcd out of sync.
  • Check the JWT in /etc/lbcli/lbcli.yaml.
  • Check for the presence of an incorrect lbcli.yaml or lbcli.json in the $HOME or current working directory.
Consistency issue with lbcli output.systemctl restart api-service
A node is clearly "Inactive".systemctl stop/start node-manager
Need to start/stop duroslight, gftl/lbe.

systemctl stop/start node-manager

Note: Consult with Lightbits Support prior to performing this action.

Note: Never restart/stop duroslight, gftl.

Do not restart/stop Cluster-managerwithout Lightbits approval. Do not restart/stop Node-manager without Lightbits approval (unless it is clearly inactive).

Services and Failure Results

The following chart details failure results and consequences of crashes/failures/stops or restarts on a Lightbits server.

ServiceFailure ResultRecoveryRecovery TimeRestart ResultRestart Time
etcdNode inactive.
cluster-manager
api-serviceNo API access.Manual0
upgrade-managerUpgrade failed.Manual0
discovery-service
profile-generatorN/A
lightbox-exporterNo metrics exported.Manual0
node-managerNode inactive.Auto-revive20 minutes (1)Node inactive.20 minutes
duroslightNode inactive.Auto-revive20 minutes (1)Node inactive.20 minutes (1)
gftlNode inactive.Auto-reviveUp to 8 hours (2)Node inactive.Up to 8 hours (2)
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard