Lightbits Cluster Architecture

The Lightbits cluster storage solution distributes services and replicates data across different Lightbits servers to guarantee service and data availability when one or more Lightbits servers experience transient or permanent failures. A cluster of Lightbits servers replicates data internally and keeps it fully consistent and available in the presence of failures. From the perspective of clients accessing the data, data replication is transparent, and server failover is seamless.

Lightbits protects the storage cluster from any instance-related failures once a node is not accessible or if a maintenance notification is received from AWS. The instance will be replaced by a new one and data will be automatically restored. The following sections describe the failure domain and volume components used in the Lightbits cluster architecture.

For more information about Lightbits cluster architecture, see the Deploying Reliable High-Performance Storage with Lightbits Whitepaper.

Volume Assignments

As described above, the Lightbits storage cluster uses node replication in the cluster for data availability and durability. Lightbits supports Replication Factor 1, 2, and 3 - where Replication Factor 3 (RF3) means that volumes are replicated on three separate storage nodes, and Replication Factor 1 (RF1) means that they are stored on only one node with no replication.

In Lightbits SDS in AWS, it is recommended to have all volumes created with RF3. In RF3, one of the storage nodes holding the volume will act as the primary (P) node for that volume, and the other two storage nodes holding a replication of the volume will act as secondary (S) nodes.

Each storage node that stores data of multiple volumes can act as a primary node for some volumes and as a secondary node for others. A primary node for a certain volume appears in the accessible path of the client using that volume, handles all user IO requests for that volume, and replicates data to the secondary nodes. If a primary node fails, the NVMe/TCP multipath feature updates the accessible path and reassigns one of the secondary nodes to be the new primary node.

When a user creates a volume, Lightbits transparently selects the nodes that hold the volume’s data and configures the primary and secondary roles. The node selection logic balances the volumes between nodes upon volume creation.

Dynamic Rebalancing

With dynamic rebalancing, the cluster will try to move volumes from one node to other nodes. Note that you will need to distinguish between proactive rebalance and fail in place (described below).

Fail in Place

When fail in place mode activates, the Lightbits cluster will try to move replications of volumes from failed nodes to other healthy nodes, while preserving failure domain requirements. Fail in place in Lightbits SDS in AWS will be triggered by the Auto Maintenance process.

Proactive Rebalance

The proactive rebalance feature enables the cluster to automatically balance volumes between nodes, based on capacity. When proactive rebalance mode is enabled, the cluster will automatically rebalance cluster capacity. This will prevent scenarios where one storage node in the cluster is over capacity reaching read-only status, while other nodes have available space to serve more capacity.

Last updated on

Was this page helpful?