Lightbits Cluster Architecture
The Lightbits cluster storage solution distributes services and replicates data across different Lightbits servers. This guarantees service and data availability when one or more Lightbits servers experience transient or permanent failures. A cluster of Lightbits servers replicates data internally and keeps it fully consistent and available in the presence of failures. From the perspective of clients accessing the data, data replication is transparent, and server failover is seamless.
Lightbits also protects the storage cluster from additional failures not related to the SSDs (e.g., CPU, memory, NICs), software failures, network failures, or rack power failures. It provides additional data security through in-server Erasure Coding (EC) that protects servers from SSD failures and enables non-disruptive maintenance routines that temporarily disable access to storage servers (e.g., TOR firmware upgrades).
The following sections describe the failure domain and volume components used in the Lightbits cluster architecture.
For more information about Lightbits cluster architecture, see the Deploying Reliable High-Performance Storage with Lightbits Whitepaper.
Nodes
Each server can be split into multiple logical nodes. Each logical node owns a specific set of SSDs and CPUs, and a portion of the RAM and NVRAM. The physical network can be shared or be exclusive per node.
Nodes can be across NUMAs or per NUMA. There is no relation or limitation between a logical node and the NUMA of the resources used by the logical node.
Each storage server runs a single Node Manager service. The service controls all the logical nodes of the storage server.
The current Lightbits release only supports up to two logical nodes per server. The single logical node deployment is commonly referred to as "single instance, node or NUMA deployment". Dual logical node deployment is referred to as "dual instance, node or NUMA deployment".
Failure Domains
Users define FDs based on data center topology and the level of protection that they strive to achieve. Each server in the cluster can be assigned to a set of FDs.
An example of an FD definition is separating racks of servers by FD labels. In this case, all servers in the same rack are assigned the same FD label, while servers in different racks are assigned distinct labels (e.g., FD label = rack ID). Two replicas of the same volume will not be located on two nodes in the same rack.
The system stores different replicas of the data on separate FDs to keep data protected from failures.
The definition of an FD is expressed by assigning FD labels to the storage nodes. Single or multiple FD labels can be assigned to every node.
Another example of an FD definition is grid topology, in which every node is assigned a label of a row and a label of a column. In this case, the volume is not stored on two servers that are placed on the same row or on the same column.
Per the previous section, servers can be configured using a single or dual instance. The same Failure Domain rules apply to dual instance, in addition to the fact that volumes will never be placed on a different node of the same server. This is because any server failure will usually affect both nodes.
For more information on Failure Domain configuration, see the Lightbits Administration Guide.