Failure Domains
A failure domain (FD) in a data center is a group of resources that are likely to fail together. For example, all of the servers in a rack might be considered a failure domain, because if there is a power outage or other failure that affects the rack, all of the servers in the rack will be affected as well.
FDs are important for designing and managing data centers because they help ensure that even if there is a failure, some of the resources in the data center will still be available. For example, if a rack fails, the data center can still function if the other racks are still available.
Another example of an FD definition is grid topology, in which every node is assigned a label of a row and a label of a column. In this case, the volume is not stored on two servers that are placed on the same row or on the same column.
Users define FDs based on data center topology and the level of protection they require for this data center. Each server in the cluster can be assigned to a set of FDs (or associated to a set of FDs).
There are a number of factors that can be used to define failure domains, such as:
- Physical location: Resources that are located in the same physical location are more likely to fail together, such as all of the servers in a rack.
- Infrastructure: Resources that share the same infrastructure are more likely to fail together, such as all of the servers that are connected to the same power supply.
- Logical dependencies: Resources that are logically dependent on each other are more likely to fail together, such as a database server and its associated storage.
Failure Domains in Lightbits
A Lightbits storage cluster is composed of a distributed set of servers. Each server can be placed on a different physical or logical group.
During the installation of Lightbits servers, servers can be labeled with a list of FDs. A minimum set of three servers with non-intersecting FDs are required for normal operation. These labels should be matched to logical and/or physical failure domains where users want to place servers logically or physically.
Lightbits storage clusters replication logic uses this information to guarantee that each replica of a volume's data will be distributed to servers that do not share failure domains. This ensures that a failure in a single failure domain (i.e., power supply of a specific rack) would only impact a single replica, ensuring that the cluster maintains the availability of data.
Per the previous section, servers can be configured using a single or dual instance. The same FD rules apply to dual instance, as they share a physical resource (same server). Multiple replicas of a volume will never be placed on the same server. This is because any server failure will usually affect both nodes.